\cftsetindents

subsection2em3em \DeclareNewFootnoteA[fnsymbol]

A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models
《基于大型语言模型的推理可信度综合调查》

Yanbo Wang School of Artificial Intelligence, University of Chinese Academy of Sciences NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences Yongcan Yu School of Artificial Intelligence, University of Chinese Academy of Sciences NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences Jian Liang School of Artificial Intelligence, University of Chinese Academy of Sciences NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences Ran He School of Artificial Intelligence, University of Chinese Academy of Sciences NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences

Abstract 摘要

Abstract: The development of Long-CoT reasoning has advanced LLM performance across various tasks, including language understanding, complex problem solving, and code generation. This paradigm enables models to generate intermediate reasoning steps, thereby improving both accuracy and interpretability. However, despite these advancements, a comprehensive understanding of how CoT-based reasoning affects the trustworthiness of language models remains underdeveloped. In this paper, we survey recent work on reasoning models and CoT techniques, focusing on five core dimensions of trustworthy reasoning: truthfulness, safety, robustness, fairness, and privacy. For each aspect, we provide a clear and structured overview of recent studies in chronological order, along with detailed analyses of their methodologies, findings, and limitations. Future research directions are also appended at the end for reference and discussion. Overall, while reasoning techniques hold promise for enhancing model trustworthiness through hallucination mitigation, harmful content detection, and robustness improvement, cutting-edge reasoning models themselves often suffer from comparable or even greater vulnerabilities in safety, robustness, and privacy. By synthesizing these insights, we hope this work serves as a valuable and timely resource for the AI safety community to stay informed on the latest progress in reasoning trustworthiness. A full list of related papers can be found at https://github.com/ybwang119/Awesome-reasoning-safety.
摘要：长 CoT 推理的发展提升了 LLM 在语言理解、复杂问题解决和代码生成等任务上的表现。该范式使模型能够生成中间推理步骤，从而提高准确性和可解释性。然而，尽管取得了这些进展，对基于 CoT 的推理如何影响语言模型可信度的全面理解仍不充分。在本文中，我们调查了推理模型和 CoT 技术方面的近期研究，重点关注可信推理的五个核心维度：真实性、安全性、鲁棒性、公平性和隐私性。对于每个方面，我们按时间顺序提供了近期研究的清晰、结构化的概述，并对其方法、发现和局限性进行了详细分析。文末还附有未来研究方向供参考和讨论。总体而言，虽然推理技术通过减少幻觉、检测有害内容和提高鲁棒性来提升模型可信度具有潜力，但最先进的推理模型本身在安全、鲁棒性和隐私方面往往存在相当甚至更大的漏洞。通过综合这些见解，我们希望这项工作能为 AI 安全社区提供一个宝贵且及时的资源，以了解推理可信度的最新进展。相关论文的完整列表可以在 https://github.上找到。com/ybwang119/Awesome-reasoning-safety.

\footnotetextA

[3]Corresponding author: Jian Liang (liangjian92@gmail.com). \footnotetextA[4]This survey considers papers published up to June 30, 2025. Work in progress.
[3]通讯作者：梁健（liangjian92@gmail.com）。\footnotetextA[4]本综述考虑了截至 2025 年 6 月 30 日发表的论文。工作进行中。

1 Introduction 1 引言

With the advancement of large language models (LLMs), Chain-of-Thought (CoT) techniques have become an important way to improve model performance on various downstream tasks, especially in math and code generation. After the release of OpenAI’s o1 series models as well as the DeepSeek-R1, developing reasoning models with system-2 thinking also attracted significant interest from researchers around the world, followed by innovations in reinforcement learning algorithms, training data generation, and adaptation methods for other tasks.
随着大型语言模型（LLMs）的进步，思维链（CoT）技术已成为提高模型在各类下游任务上性能的重要途径，特别是在数学和代码生成领域。在 OpenAI 的 o1 系列模型以及 DeepSeek-R1 发布后，开发具有系统-2 思维的推理模型也吸引了全球研究人员的广泛关注，随后在强化学习算法、训练数据生成以及其他任务的适应方法上出现了创新。

Despite these improvements, the trustworthiness of CoT techniques as well as reasoning models remains underexplored. Intuitively, it may be reasonable that the thinking capability could be generalized to the trustworthiness domain, resulting in a safer and more reliable model. However, recent works [1, 2, 3] did not support such an ideal hypothesis. Furthermore, prior surveys on LLM safety [4, 5, 6] provide little discussion of reasoning as a factor in model trustworthiness. This gap motivates the central question: What does the reasoning capability bring to the language model trustworthiness?
尽管取得了这些改进，但 CoT 技术以及推理模型的可靠性仍待深入探索。直观上，思维能力可能可以泛化到可靠性领域，从而产生更安全、更可靠的模型。然而，最近的研究工作[ 1, 2, 3]并未支持这种理想假设。此外，关于 LLM 安全的先验综述[ 4, 5, 6]对推理作为模型可靠性因素讨论甚少。这一空白促使了核心问题：推理能力为语言模型可靠性带来了什么？

To answer this question, we propose the first comprehensive survey to thoroughly review recent advancements in trustworthy reasoning. We unfold our survey through five main components: truthfulness, safety, robustness, fairness, and privacy. In the truthfulness section, with a focus on model reliability, we include hallucination and reasoning faithfulness, encompassing hallucination detection and mitigation methods with CoT techniques, hallucination analysis in reasoning models, reasoning faithfulness measurement, faithfulness understanding, as well as methods to improve reasoning faithfulness. In the safety section, we aim to understand the harmlessness of the generation content, and mainly take vulnerability assessment, jailbreak, alignment, and backdoor into consideration. For better readability, we specifically distinguish between jailbreak attacks targeting reasoning models and the use of reasoning techniques in attack and defense, forming different paragraphs to structure the literature. In the robustness section, we mainly focus on adversarial input noises that elicit false answers at inference time. The overthinking and underthinking problems are highlighted as a special case when language models are equipped with reasoning capability. After that, in the fairness section, we mainly cover the latest evaluations and methods for bias detection. As for the privacy section, we split the related works into model-related privacy and prompt-related privacy, with topics containing model unlearning, IP protection, watermarking, and privacy inference.
为了回答这个问题，我们提出了首个全面综述，旨在深入探讨可信推理的最新进展。我们的综述通过五个主要组成部分展开：真实性、安全性、鲁棒性、公平性和隐私性。在真实性部分，我们重点关注模型可靠性，包括幻觉和推理忠实性，涵盖使用思维链技术的幻觉检测与缓解方法、推理模型中的幻觉分析、推理忠实性度量、忠实性理解，以及提高推理忠实性的方法。在安全性部分，我们旨在理解生成内容的无害性，主要考虑漏洞评估、越狱攻击、对齐和后门。为了提高可读性，我们特别区分了针对推理模型的越狱攻击和攻击与防御中使用推理技术的情况，将文献分成不同段落进行结构化。在鲁棒性部分，我们主要关注在推理时导致错误答案的对抗性输入噪声。当语言模型具备推理能力时，过度思考和思考不足问题被突出为一种特殊情况。随后，在公平性部分，我们主要涵盖了最新的偏见检测评估和方法。至于隐私部分，我们将相关研究工作分为模型相关隐私和提示相关隐私，主题包括模型去学习、知识产权保护、水印和隐私推理。

While existing surveys have explored reasoning techniques [7, 8] and reasoning efficiency [9, 10, 11], relatively little attention has been paid to the trustworthiness of reasoning in large language models. A related survey [12] provided valuable discussions on safety-related aspects. In contrast, our work offers a more comprehensive perspective on trustworthiness. In general, we provide a clear taxonomy for model trustworthiness in reasoning, which includes both early CoT techniques and end-to-end reasoning models. Through our review of existing work, we suggest that reasoning techniques not only facilitate the development of more interpretable and trustworthy models but also introduce new vulnerabilities. As models acquire more advanced reasoning capabilities, the attack surface correspondingly expands, enabling more complex and targeted adversarial strategies. We hope that both the surveyed literature and our proposed taxonomy will serve as a timely reference for the AI safety community, supporting ongoing efforts to understand and improve the trustworthiness of reasoning in language models.
尽管现有综述已探讨了推理技术[7, 8]和推理效率[9, 10, 11]，但相对较少关注大型语言模型中推理的可信度。一项相关综述[12]对安全相关方面提供了有价值的讨论。相比之下，我们的工作提供了更全面的可信度视角。总体而言，我们为推理中的模型可信度提供了一个清晰的分类体系，包括早期的 CoT 技术和端到端推理模型。通过我们对现有工作的综述，我们提出推理技术不仅促进了更可解释和可信度更高的模型开发，还引入了新的漏洞。随着模型获得更高级的推理能力，攻击面相应扩大，使得更复杂和有针对性的对抗策略成为可能。我们希望所调查的文献和我们所提出的分类体系能为人工智能安全社区提供及时的参考，支持其持续努力理解和提升语言模型中推理的可信度。

Table 1: List of Abbreviations and Acronyms
表 1：缩写词和首字母缩略词列表

Abbreviation 缩写词	Full Term 完整术语	Abbreviation 缩写词	Full Term 完整术语
AOC	Area Over Curve 曲线下面积	MCTS	Monte-Carlo Tree Search 蒙特卡洛树搜索
ASR	Attack Success Rate 攻击成功率	MLLM	Multimodal Large Language Model 多模态大语言模型
CNN	Convolutional Neural Network 卷积神经网络	MLRM	Multimodal Large Reasoning Model 多模态大推理模型
CoT	Chain-of-Thought 思维链	ORM	Outcome Reward Model 结果奖励模型
DFS	Depth-First Search 深度优先搜索	PRM	Process Reward Model 过程奖励模型
DPO	Direct Preference Optimization 直接偏好优化	QA	Question-Answering 问答
GRPO	Group Relative Policy Optimization 组相对策略优化	RL	Reinforcement Learning 强化学习
ICL	In-Context Learning 情境学习	RLHF	Reinforcement Learning from Human Feedback 人类反馈强化学习
KL	Kullback-Leibler Divergence Kullback-Leibler 散度	RLVR	Reinforcement Learning with Verifiable Reward 可验证奖励强化学习
LAS	Leakage-Adjusted Simulatability 泄漏调整模拟性	RAG	Retrieval-Augmented Generation 检索增强生成
LLM	Large Language Model 大型语言模型	SCM	Structural Causal Model 结构因果模型
LRM	Large Reasoning Model 大型推理模型	SoTA	State-of-the-Art 最先进技术
LoRA	Low-Rank Adapter 低秩适配器	SFT	Supervised Fine-Tuning 监督微调
MoE	Mixture-of-Experts 专家混合模型	VR	Verifiable Reward 可验证奖励

2 Background 2 背景

In this section, we provide an overview of fundamental concepts related to reasoning in language models, including discussions of the general definition of reasoning, an introduction to CoT as a widely adopted technique, and key considerations in model training that influence the reasoning abilities.
在本节中，我们概述了与语言模型推理相关的基本概念，包括对推理的一般定义的讨论、作为广泛采用技术的思维链（CoT）的介绍，以及影响推理能力的关键模型训练考虑因素。

2.1 Large Language Model Reasoning
2.1 大型语言模型推理

LLM reasoning is a novel paradigm that leverages the knowledge embedded within models like GPT-4 [13], Claude [14], and DeepSeek-R1 [15] to solve complex tasks—such as math, coding, and logical reasoning—by mimicking human cognitive processes. Typically, LLM reasoning involves generating both the final answer and the intermediate steps, often referred to as “thoughts”, which guide the model from the question to the answer. Formally, given a prompt $x$ and context $C$ , the reasoning of an LLM $\mathcal{M}$ can be represented as follows:
LLM 推理是一种新范式，它利用嵌入在 GPT-4[13]、Claude[14]和 DeepSeek-R1[15]等模型中的知识，通过模仿人类认知过程来解决复杂任务——如数学、编程和逻辑推理。通常，LLM 推理会生成最终答案和中间步骤，这些中间步骤通常被称为“思考”，它们引导模型从问题到答案。形式上，给定一个提示 $x$ 和上下文 $C$ ，一个 LLM $\mathcal{M}$ 的推理可以表示如下：

T,A=\mathcal{M}(x,C),

(1)

where $T$ refers to the intermediate reasoning process and $A$ is the answer. By enabling the AI system to generate interpretable reasoning steps alongside the solution, LLM reasoning not only solves complex tasks but also improves human understanding of the problem-solving process, thereby enhancing its utility and reliability. Currently, the two main paradigms for implementing large language model reasoning are CoT prompting and large reasoning model training.
其中 $T$ 指代中间推理过程， $A$ 是答案。通过使 AI 系统能够在提供解决方案的同时生成可解释的推理步骤，LLM 推理不仅解决了复杂任务，还提高了人类对问题解决过程的理解，从而增强了其效用和可靠性。目前，实现大型语言模型推理的两种主要范式是 CoT 提示和大型推理模型训练。

2.2 Chain-of-Thought Prompting
2.2Chain-of-Thought 提示

CoT prompting [16, 17] is a prompt engineering technique designed to elicit a sequence of intermediate reasoning steps referred to as the thought, before providing the final answer. There are various methods for implementing CoT, with two of the most common being few-shot-CoT [16] and zero-shot-CoT [17]. As illustrated in Figure˜1, few-shot-CoT mirrors the approach of few-shot in-context learning (ICL) [18], utilizing a small number of examples to guide the model in answering questions. Unlike traditional ICL, few-shot-CoT [19] not only shows the answer in the demonstrations, but also gives the specific reasoning steps before the answer. Therefore, the model will also give CoT before answering the question. While few-shot-CoT demonstrates strong performance on complex tasks such as math and symbolic reasoning, it requires human-annotated, task-specific examples with intricate reasoning paths, limiting its applicability. In contrast, zero-shot-CoT [16] offers a more flexible, task-agnostic method for eliciting CoT by simply adding the prefix “Let’s think step by step” before generating the answer.
CoT 提示[16, 17]是一种提示工程技术，旨在提供最终答案之前，引导模型输出一系列中间推理步骤，这些步骤被称为思考过程。实现 CoT 的方法有多种，其中最常见的是少样本 CoT[16]和零样本 CoT[17]。如图 1 所示，少样本 CoT 模仿了少样本情境学习（ICL）[18]的方法，利用少量示例来引导模型回答问题。与传统的 ICL 不同，少样本 CoT[19]不仅会在示例中展示答案，还会在答案之前提供具体的推理步骤。因此，模型在回答问题时也会给出 CoT。虽然少样本 CoT 在数学和符号推理等复杂任务上表现出色，但它需要人工标注的、特定任务的示例，且推理路径复杂，限制了其适用性。相比之下，零样本 CoT[16]通过在生成答案前添加前缀“让我们一步步思考”，提供了一种更灵活、任务无关的方法来引导 CoT。

Refer to caption — Figure 1: Illustration of typical CoT prompting. Few-shot-CoT uses several examples with the reasoning process to elicit CoT, and zero-shot-CoT uses a prefix prompt to induce the reasoning process.
图 1：典型的思维链提示说明。Few-shot-CoT 使用带推理过程的几个例子来引出思维链，而 zero-shot-CoT 使用前缀提示来诱导推理过程。

2.3 Large Reasoning Models
2.3 大型推理模型

Large reasoning models (LRMs), represented by OpenAI o1 [20] and DeepSeek-R1 [15], refer to a series of large language models that explicitly generate their thinking process before filling the final answers [8]. Instead of prompting models to “think step by step”, reasoning models could automatically create the thinking process that mimics how humans analyze a problem.
大型推理模型（LRMs），以 OpenAI 的 o1 [ 20] 和 DeepSeek-R1 [ 15] 为代表，指的是一系列在填写最终答案之前明确生成其思考过程的大型语言模型 [ 8]。与提示模型“逐步思考”不同，推理模型可以自动创建模仿人类分析问题方式的思考过程。

2.3.1 Model Training 2.3.1 模型训练

There are a few open-source trials to replicate the o1 series [8], including OpenR [21], o1-journey [22, 23, 24], and LLaMA-Berry [25]. The key to the replication lies in distilling long CoT data, even if the source model has not been explicitly trained for reasoning. LLaMA-Berry [25] utilized Monte Carlo tree search (MCTS) [26] with a pairwise preference reward model to scale test-time compute, achieving a higher performance on multiple Math datasets such as GSM8k [27], MATH [28], GaoKao2023En [29], etc. O1-journey [22] utilized MCTS with a fine-grained reward model to construct long CoT data. After building the reasoning tree with each node annotated with a reward score indicating correctness, a traversal algorithm such as Depth-First Search (DFS) with constraints could be adopted to create a datapoint using an error-then-backtrack style. Supervised fine-tuning (SFT), followed by Direct Preference Optimization (DPO) [30], was then leveraged to train the reasoning model. OpenR [21] introduced reinforcement learning with a process reward model to encourage reasoning capability. During training, the LLM policy was updated at each reasoning step using intermediate step-wise rewards from the reward model, optimized with either the proximal policy optimization (PPO) [31] or the group relative policy optimization (GRPO) [32]. Except for these tree searching methods, DeepSeek-R1 demonstrated the outstanding performance of pure reinforcement learning in boosting reasoning capability, utilizing distilled data from R1-Zero¹¹1The model for long CoT data synthesis underwent preliminary supervised fine-tuning (cold start). Therefore, it is slightly different from the released R1-Zero model.
用于长 CoT 数据的合成模型经历了初步的监督微调（冷启动）。因此，它与发布的 R1-Zero 模型略有不同。
有一些开源项目尝试复制 o1 系列[ 8]，包括 OpenR [ 21]、o1-journey [ 22, 23, 24]和 LLaMA-Berry [ 25]。复制的关键在于蒸馏长 CoT 数据，即使源模型没有经过明确的推理训练。LLaMA-Berry [ 25]利用蒙特卡洛树搜索（MCTS）[ 26]和成对偏好奖励模型来扩展测试时的计算量，在多个数学数据集（如 GSM8k [ 27]、MATH [ 28]、GaoKao2023En [ 29]等）上取得了更高的性能。O1-journey [ 22]利用 MCTS 和细粒度奖励模型来构建长 CoT 数据。在用奖励分数标注正确性的推理树节点构建完成后，可以采用带约束的深度优先搜索（DFS）等遍历算法，以错误回溯的方式创建数据点。随后利用监督微调（SFT）和直接偏好优化（DPO）[ 30]来训练推理模型。OpenR [ 21]引入了过程奖励模型的强化学习，以鼓励推理能力。在训练过程中，LLM 策略在每个推理步骤中使用来自奖励模型的中间步骤奖励进行更新，优化方法采用近端策略优化（PPO）[31]或组相对策略优化（GRPO）[32]。除了这些树搜索方法，DeepSeek-R1 展示了纯强化学习在提升推理能力方面的卓越性能，利用 R1-Zero ¹ 的蒸馏数据 to train the base model. One point worth noting is that, except for latent reasoning models [33, 34], there is no obvious difference between previous chat models and current reasoning models in terms of model structure. In fact, all these models are developed based on well-trained chat models such as DeepSeek-V3 [35], Qwen2.5 [36], and Llama-3 series [37].
用于训练基础模型。有一点值得注意，除了潜推理模型[33, 34]之外，之前的聊天模型和当前的推理模型在模型结构上没有明显区别。事实上，所有这些模型都是基于 DeepSeek-V3[35]、Qwen2.5[36]和 Llama-3 系列[37]等训练良好的聊天模型开发的。

PRM, ORM, and VR. According to Uesato et al. [38], current reward models could be divided into two types: process reward model (PRM) and outcome reward model (ORM), in which the former provides stepwise reward on each reasoning process, and the latter simply gives one score for the whole generation sequence. Instead of ORM [39], Lightman et al. [40] proposed PRM to verify the thinking process step by step, and demonstrated its superior performance to ORM in providing more reliable step-wise reward. For inference-time scaling, these reward models could not only facilitate the tree search at inference time for better performance, but also help filter reasoning trajectories with higher quality for post-training. Before the release of DeepSeek-R1 [15], the training of reward models is crucial for reasoning model development. Verifiable reward (VR) was first proposed by Lambert et al. [41], which includes three types: correctness verification, verification via execution, and verifiable constraints [42]. Different from reward models, here we define verifiable reward as “the reward provided by a simple deterministic function instead of large models, which is objective, usually binary, and outcome-based”. DeepSeek-R1 demonstrates the effectiveness of VR, which is then regarded as a prevailing post-training method when combined with GRPO.
PRM、ORM 和 VR。根据 Uesato 等人[38]的研究，当前的奖励模型可以分为两种类型：过程奖励模型（PRM）和结果奖励模型（ORM），前者在推理过程中每一步都提供奖励，后者则对整个生成序列给出一个分数。与 ORM[39]不同，Lightman 等人[40]提出了 PRM 来逐步验证思考过程，并证明了它在提供更可靠的逐步奖励方面优于 ORM。在推理时扩展方面，这些奖励模型不仅可以促进推理时的树搜索以获得更好的性能，还可以帮助筛选出用于后训练的高质量推理轨迹。在 DeepSeek-R1[15]发布之前，奖励模型的训练对于推理模型开发至关重要。可验证奖励（VR）最早由 Lambert 等人[41]提出，包括三种类型：正确性验证、执行验证和可验证约束[42]。与奖励模型不同，在这里我们将可验证奖励定义为“由简单的确定性函数提供的奖励，而不是大型模型，它是客观的、通常是二元的、基于结果的”。DeepSeek-R1 展示了 VR 的有效性，当与 GRPO 结合时，它被视为一种流行的后训练方法。

2.3.2 Multimodal LRM 2.3.2 多模态 LRM

Li et al. [43] summarized the development of multimodal large reasoning models (MLRMs) into three stages: “perception driven modular reasoning”, “language-centric short reasoning”, and “language-centric long reasoning”. Like the development of unimodal large reasoning models, MLRMs also experienced the transformation from zero-shot or few-shot CoT prompting to long reasoning data post-training [44]. For example, Multimodal-CoT [45], VoT [46], and VIC [47] are some of the early works that focused on the prompting to elicit model thinking. In terms of training, LLaVA-CoT [48], Llamav-o1 [49], RedStar [50], and Mulberry [51] propose to empower multimodal large language models (MLLMs) with reasoning capabilities by finetuning base models. As stated in Section˜2.3.1, multimodal CoT data generation is also crucial for model training, and the construction of the reasoning path includes distillation [48, 49, 52, 53, 54] or MCTS [51, 55], which also resembles the way mentioned for text-domain CoT data generation.
李等人[43]将多模态大推理模型（MLRM）的发展总结为三个阶段：“感知驱动模块化推理”、“语言中心化短推理”和“语言中心化长推理”。与单模态大推理模型的发展类似，MLRM 也经历了从零样本或少样本 CoT 提示到训练后长推理数据的转变[44]。例如，Multimodal-CoT[45]、VoT[46]和 VIC[47]是一些早期专注于提示以引出模型思考的工作。在训练方面，LLaVA-CoT[48]、Llamav-o1[49]、RedStar[50]和 Mulberry[51]提出通过微调基础模型来赋予多模态大语言模型（MLLM）推理能力。如第 2.3.1 节所述，多模态 CoT 数据生成对于模型训练也至关重要，推理路径的构建包括蒸馏[48,49,52,53,54]或 MCTS[51,55]，这也类似于文本域 CoT 数据生成的方式。

As for model training, pure GRPO and SFT followed by GRPO become the prevailing method for reasoning model development [44], which may be attributed to the outstanding performance of RL demonstrated by DeepSeek-R1.
关于模型训练，纯 GROPO 和 SFT 随后再进行 GROPO 成为推理模型开发的主流方法[44]，这可以归因于 DeepSeek-R1 所展示的强化学习卓越性能。

{forest}

forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=center, font=, rectangle, draw=hidden-draw, rounded corners, align=left, minimum width=4em, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, line width=0.8pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=6em,align=center,font=,, where level=2text width=8.2em,align=center,font=,, where level=3text width=8.2em,align=center,font=,, where level=4text width=8.2em, font=,, [ Trustworthiness in language model reasoning, ver [ Truthfulness
分叉边，对于树= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=center, font=, rectangle, draw=hidden-draw, rounded corners, align=left, minimum width=4em, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, line width=0.8pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=6em,align=center,font=,, where level=2text width=8.2em,align=center,font=,, where level=3text width=8.2em,align=center,font=,, where level=4text width=8.2em, font=,, [ 语言模型推理中的可信度，ver [ 真实性
(§3), my-box1 [ Hallucination
(§3), my-box1 [ 虚构
(§3.1), my-box1 [ Hallucination with
(§3.1), 我的盒子 1 [幻觉与
reasoning techniques, my-box2 [Multimodal-CoT [45], HaluSearch [56], HalluMeasure [57],
推理技术，我的盒子 2 [多模态思维链 [45]，HaluSearch [56]，HalluMeasure [57]，
CLATTER [58], Reflexive Prompting [59], GCoT [60], CoMT [61], leaf, text width=26em ] ] [ Hallucination of
CLATTER [ 58], 反射式提示 [ 59], GCoT [ 60], CoMT [ 61], 叶片, 文本宽度=26em ] ] [ 幻觉
reasoning models, my-box2 [ MIRAGE [62], SUM [63], RH-Bench [64], Yao et al. [65],
推理模型，我的盒子 2 [ MIRAGE [ 62], SUM [ 63], RH-Bench [ 64], Yao 等人 [ 65],
VIC [47], AbstentionBench [66], Lu et al. [67], FSPO [68],
Anh et al. [69], GRPO-R [70], RFMDataset [71], FG-PRM [72],
Anh 等人[69]，GRPO-R[70]，RFMDataset[71]，FG-PRM[72]，
Zhang et al. [73], RACE [74] , leaf, text width=26em ] ] ] [ Faithfulness of
张等人[73]、RACE[74]、leaf，text width=26em] ] ] [ 忠实度
Reasoning Models   推理模型
(§3.2), my-box1 [ Measuring &
(§3.2), 我的盒子 1 [ 测量 &
understanding, my-box1 [ Lanham et al. [75], Turpin et al. [76], PFF [77], Xiong et al. [78],
理解, 我的盒子 1 [ 兰哈姆等[ 75], 特平等[ 76], PFF[ 77], 熊等[ 78],
Bentham et al. [79], Arcuschin et al. [80], Chua et al. [81],
本瑟姆等[ 79], 阿尔库什因等[ 80], 茹阿等[ 81],
Chen et al. [82], Li et al. [83], Agarwal et al. [84], Bao et al. [85],
陈等[ 82], 李等[ 83], 阿格拉瓦尔等[ 84], 包等[ 85],
Tanneru et al. [86], Lobo et al. [87] , leaf, text width=26em ] ] [ Faithfulness
Tanneru 等[86], Lobo 等[87], leaf, text width=26em] ] [ 忠实性
improvement, my-box1 [ FRODO [88], SymbCoT [89], Radhakrishnan et al. [90],
改进, my-box1 [ FRODO[88], SymbCoT[89], Radhakrishnan 等[90],
Faithful CoT [91], LOGIC-LM [92], FLARE [93], CoMAT [94],
忠实 CoT[91], LOGIC-LM[92], FLARE[93], CoMAT[94],
CORE [95], QUIRE [83], Fact [96], Viteri et al. [97], leaf, text width=26em ] ] ] ] [ Safety
CORE[95], QUIRE[83], Fact[96], Viteri 等[97], leaf, text width=26em] ] ] ] ] [ 安全
(§4), my-box1 [ Vulnerability
(§4), my-box1 [ 漏洞
Assessment   评估
(§4.1), my-box1 [ SafeChain [1], CNSafe [3], Zhang et al. [98], Romero et al. [99], Zhou et al. [100],
(§4.1), my-box1 [ SafeChain [ 1], CNSafe [ 3], 张等人 [ 98], 罗马诺等人 [ 99], 周等人 [ 100],
Li et al. [101], Lou et al. [102], kassianik et al. [103], Krishna et al. [104], FORTRESS [105],
李等人 [ 101], 刘等人 [ 102], kassianik 等人 [ 103], 克里希纳等人 [ 104], FORTRESS [ 105],
Fan et al. [106], BSAbench [107], Is-bench [108], SafeMLRM [109], Zhao et al. [110],
Fan 等人[ 106]、BSAbench[ 107]、Is-bench[ 108]、SafeMLRM[ 109]、Zhao 等人[ 110]、
Marjanović et al. [111] , leaf, text width=36em,align=left ] ] [ Jailbreak
Marjanović等人[111]，叶，文本宽度=36em，对齐=左] ] [越狱
(§4.2), my-box1 [ Attack with
(§4.2), my-box1 [ 攻击 ]
reasoning techniques, my-box1 [ Sabbaghi et al. [112], CoT-GCG [113], Ying et al. [114],
推理技术，我的盒子 1 [Sabbaghi 等人[112]，CoT-GCG [113]，Ying 等人[114]，
Chain-of-Lure [115], Handa et al. [116] , leaf, text width=26em ] ] [ Attack on
Chain-of-Lure [ 115], Handa 等人[ 116], leaf, 文本宽度=26em ] ] [ 攻击
reasoning models, my-box1 [ H-CoT [117], Mousetrap [118], AutoRAN [119], SEAL [120],
推理模型，我的盒子 1 [ H-CoT [ 117], 鼠夹 [ 118], AutoRAN [ 119], SEAL [ 120],
FicDetail [2], Lian et al. [121], RRTL [122], VisCRA [123],
FicDetail [ 2], 李连等[ 121], RRTL [ 122], VisCRA [ 123],
HauntAttack [124] , leaf, text width=26em ] ] [ Defense with
HauntAttack [ 124] , 叶片, 文本宽度=26em ] ] [ 防御与
reasoning techniques, my-box1 [ GuardReasoner [125], X-Guard [126], MrGuard [127] ,
推理技术，my-box1 [ GuardReasoner [ 125], X-Guard [ 126], MrGuard [ 127],
RSafe [128], Sreedhar et al. [129], R²-Guard [130], DR-IRL [131],
ShieldVLM [132], GuardReasoner-VL [133], GuardAgent [134],
ShieldAgent [135], Wang et al. [136], U-CoT+ [137] , leaf, text width=26em ] ] [ Defense for
ShieldAgent [ 135], Wang et al. [ 136], U-CoT+ [ 137] , leaf, text width=26em ] ] [ 防御
reasoning models, my-box1 [ SafeChain [1], Thinking Intervention [138], Wang et al. [12],
推理模型，my-box1 [ SafeChain [ 1], 思维干预 [ 138], 王等 [ 12]，
Yamaguchi et al. [139], Zaremba et al. [140], Saffron-1 [141] , leaf, text width=26em ] ] ] [ Alignment
山口等人[139]、扎雷姆巴等人[140]、Saffron-1[141] ，叶，文本宽度=26em ] ] ] [ 对齐
(§4.3), my-box1 [ Aligning LLM using
(§4.3), my-box1 [ 使用 LLM 对齐 ]
reasoning techniques, my-box1 [ Liu et al. [142], Zhang et al. [143], SCoT [144], STAIR [144],
推理技术，我的盒子 1 [刘等人[142]，张等人[143]，SCoT[144]，STAIR[144]，
R2D [145], RATIONAL [146], ERPO [147], SaRO [148],
Wang et al. [12], Kim et al. [149], Thought-Aligner [150],
王等 [ 12], 金等 [ 149], Thought-Aligner [ 150],
ReasoningShield [151] , leaf, text width=26em ] ] [ Alignment of
ReasoningShield [ 151] , leaf, text width=26em ] ] [ 推理模型的对齐,
reasoning models, my-box1 [ Deliberate Alignment [152], SafeChain [1], STAR-1 [153],
my-box1 [ 故意对齐 [ 152], SafeChain [ 1], STAR-1 [ 153],
RealSafe [154], SAFEPATH [155], Context Reasoner [156],
Lou et al. [102], Baker et al. [157], Zhang et al. [158], Hair [159],
Liu et al. [160], Safety Tax [161], SafeKey [162] , leaf, text width=26em ] ] ] [ Backdoor
(§4.4), my-box1 [ Training-time
data poisoning, my-box1 [ SABER [163], BoT [164], ShadowCoT [165], Chua et al. [166] , leaf, text width=26em ] ] [ Inference-time
数据中毒，my-box1 [ SABER [ 163], BoT [ 164], ShadowCoT [ 165], Chua 等人 [ 166] , 叶片, 文本宽度=26em ] ] [ 推理时
prompt manipulation, my-box1 [ Badchain [167], BackdoorLLM [168], DarkMind [169], CPT [170],
提示操纵，my-box1 [ Badchain [ 167], BackdoorLLM [ 168], DarkMind [ 169], CPT [ 170],
Guo et al. [171], Cui et al. [172], Cui et al. [173] Song et al. [174] , leaf, text width=26em ] ] [ Backdoor
Guo 等人 [ 171], Cui 等人 [ 172], Cui 等人 [ 173] Song 等人 [ 174] , 叶片, 文本宽度=26em ] ] [ 后门
defense, my-box1 [ Chain-of-Scrutiny [19], Marinelli et al. [175], GUARD [176] , leaf, text width=26em ] ] ] ] [ Robustness
防御，my-box1 [ Chain-of-Scrutiny [ 19], Marinelli 等人 [ 175], GUARD [ 176] , 叶片, 文本宽度=26em ] ] ] ] [ 鲁棒性
(§5), my-box1 [ Improvement with
(§5), my-box1 [ 改进
reasoning techniques   推理技术
(§5.1), my-box1 [ Wang et al. [177], CoDT [178], Yan et al. [179], RBD [180], Zaremba et al. [140] , leaf, text width=36em, align=left ] ] [ Robustness of
(§5.1), my-box1 [ 王等 [ 177], CoDT [ 178], 严等 [ 179], RBD [ 180], 扎雷姆巴等 [ 140] , leaf, text width=36em, align=left ] ] [ 对抗鲁棒性
reasoning models   推理模型
(§5.2), my-box1 [ RUPbench [181], Mu et al. [182], RoR-bench [179], M-Attack [183], GaslightingBench-R [184],
Zhou et al. [185], Peng et al. [186], PolyMath [187], CatAttack [188], Math-RoB [189],
MATH-Perturb [190], CodeCrash [191], CoCC [192], AbstentionBench [66], Xu et al. [193] , leaf, text width=36em, align=left ] ] [ Overthinking
(§5.3), my-box1 [ UMP [194], DNR Bench [195], DeltaBench [196], MiP-Overthinking [197], Si et al. [198],
Wang et al. [199], Su et al. [200], Dang et al. [201], Overthink [202], Cuadron et al. [203] , leaf, text width=36em,align=left ] ] [ Underthinking
王等人[199], 苏等人[200], 耿等人[201], Overthink[202], 库阿德罗等人[203], leaf, text width=36em,align=left ] ] [ 低估
(§5.3), my-box1 [ CPT [170], Zaremba et al. [140], Zhao et al. [110], Li et al. [204], ThinkEdit [205] , leaf, text width=36em,align=left ] ] ] [ Fairness
(§5.3), my-box1 [ CPT [170], 扎雷姆巴等人[140], 赵等人[110], 李等人[204], ThinkEdit [205] , leaf, text width=36em,align=left ] ] ] [ 公平性
(§6), my-box1 [ Evaluation &
(§6), my-box1 [ 评估 &
Detection, my-box1 [ Lin et al. [206], Cheng et al. [207], Kamruzzaman et al.[208], Dash et al.[209],
检测, my-box1 [ 林等人[206], 程等人[207], 卡姆鲁兹曼等人[208], 达什等人[209],
Gupta et al.[210], BiasGuard [211], Cantini et al. [212] , leaf, text width=36em,align=left ] ] ] [ Privacy
Gupta 等人[210], BiasGuard[211], Cantini 等人[212], leaf, text width=36em,align=left] [隐私
(§7), my-box1 [ Model-related
(§7), my-box1 [模型相关
Privacy (§7.1), my-box1 [ R-TOFU [213], R²MU [214], SLEEK [215], ImF [216], CoTSRF [217], Guo et al. [218],
隐私(§7.1), my-box1 [R-TOFU[213], R ² MU[214], SLEEK[215], ImF[216], CoTSRF[217], Guo 等人[218],
Savani et al. [219] , leaf, text width=36em,align=left ] ] [ Prompt-related
Savani 等人[219], leaf, text width=36em,align=left] [提示相关]
Privacy (§7.2), my-box1 [ Green et al. [220], DoxBench [221] , leaf, text width=36em,align=left ] ] ] ]
隐私（§7.2），my-box1 [ Green 等人 [ 220]，DoxBench [ 221]，leaf，text width=36em，align=left ] ] ] ]

Figure 2: Taxonomy of trustworthiness in reasoning with large language models.
图 2：使用大型语言模型进行推理的可信度分类学

3 Truthfulness 3 真确性

Truthfulness in the LLMs refers to how an AI system accurately represents information, facts, and results [222]. This fundamental dimension of truthfulness focuses on the model’s ability to provide factually correct and reliable information without generating misleading or false content. In this section, we discuss the new challenges brought by the reasoning techniques, including two aspects: hallucination and faithfulness.
LLMs 中的真确性是指一个 AI 系统如何准确地呈现信息、事实和结果[ 222]。这一真确性的基本维度关注于模型提供事实正确且可靠信息的能力，而不会生成误导性或虚假内容。在本节中，我们讨论了推理技术带来的新挑战，包括两个方面：幻觉和忠实性。

3.1 Hallucination 3.1 幻觉

Hallucination in LLMs refers to instances where models generate responses that appear coherent and plausible but are inconsistent with the input, context, or factual information [223, 224]. The emergence of reasoning models introduces new risks and challenges in managing hallucinations. First, reasoning models often generate responses that are more structured, logically coherent, and superficially persuasive, making them appear more reliable. As a result, hallucinated content from these models can appear more credible, making it harder for users to detect inaccuracies and increasing the risk of spreading misinformation [70], especially in high-stakes fields such as healthcare, law, or education. On the other hand, the CoT reasoning generated by models can also contain hallucinations [68]. Compared to traditional LLMs, the hallucinations in reasoning models have not been as thoroughly evaluated. Moreover, the powerful reasoning capabilities of these models can be leveraged to detect or mitigate hallucinations in certain complex tasks [56, 58].
LLMs 中的幻觉指的是模型生成看似连贯且合理的响应，但这些响应与输入、上下文或事实信息不一致的情况[223, 224]。推理模型的涌现为管理幻觉带来了新的风险和挑战。首先，推理模型通常生成结构更清晰、逻辑上连贯且表面上有说服力的响应，使其看起来更可靠。因此，这些模型产生的幻觉内容可能更具可信度，使用户更难检测错误，增加了错误信息的传播风险[70]，特别是在医疗保健、法律或教育等高风险领域。另一方面，模型生成的 CoT 推理也可能包含幻觉[68]。与传统 LLMs 相比，推理模型中的幻觉尚未得到充分评估。此外，这些模型强大的推理能力可以被用于检测或减轻某些复杂任务中的幻觉[56, 58]。

3.1.1 Hallucination with Reasoning Techniques
3.1.1Hallucination 使用推理技术

In this section, we explore how reasoning techniques can be leveraged to detect and mitigate hallucinations in LLMs. CoT prompting has shown remarkable success in addressing complex tasks [17, 16] and reducing hallucinations [225]. To further enhance model reasoning capabilities, several techniques have been proposed, such as test-time scaling [226], self-consistency [227], etc. One such approach, HaluSearch [56], employed a tree search-based algorithm coupled with a switch model to determine when to engage in more deliberate, “slow thinking” processes. In contrast to hallucination mitigation, HalluMeasure [57] focused on fine-grained hallucination measurement, using CoT prompting. Specifically, it decomposed model responses into a series of claims and applies CoT techniques to detect hallucinations at the claim level. Similarly, CLATTER [58] adopted a multi-step reasoning process for hallucination detection, consisting of decomposition, attribution, entailment, and aggregation. Moreover, Xie et al. [59] observed that the order in which reasoning steps are applied can influence hallucination occurrence. As such, they propose Reflexive Prompting, which combines “answer-first” and “logic-first” reasoning strategies to improve model accuracy. Beyond text-based tasks, Zhang et al. [45] extended CoT to multimodal settings, proposing a method to mitigate visual hallucinations. Their approach involves generating a rationale that is used to update the language input, which is then combined with the original visual input to produce the final answer. Furthermore, Wu et al. [60] introduced Grounded Chain-of-Thought (GCoT), a technique in which the model gradually grounds visual cues before generating answers. This step-by-step process helps mitigate visual hallucinations by enhancing the model’s understanding of the input. In addition, in the context of medical report generation, CoMT [61] leveraged CoT prompting to reduce hallucinations and produce high-quality, accurate reports. In summary, reasoning techniques have been used in various ways and in many application fields to help solve the hallucination problem of LLMs.
在本节中，我们探讨了如何利用推理技术来检测和缓解 LLMs 中的幻觉。CoT 提示方法在处理复杂任务方面表现出色[17, 16]，并有效减少了幻觉[225]。为了进一步提升模型的推理能力，研究者们提出了多种技术，如测试时缩放[226]、自洽性[227]等。其中一种方法是 HaluSearch[56]，该技术采用基于树搜索的算法与切换模型相结合，以确定何时启动更审慎的“慢思考”过程。与幻觉缓解不同，HalluMeasure[57]专注于细粒度的幻觉测量，并使用 CoT 提示方法。具体而言，它将模型响应分解为一系列主张，并应用 CoT 技术来检测主张层面的幻觉。类似地，CLATTER[58]采用多步推理过程进行幻觉检测，包括分解、归因、蕴涵和聚合。此外，Xie 等人[59]观察到推理步骤的应用顺序会影响幻觉的发生。因此，他们提出了反思性提示方法，该方法结合了“先回答”和“先逻辑”的推理策略以提高模型准确性。除了基于文本的任务外，Zhang 等人[45]将思维链扩展到多模态环境，提出了一种减轻视觉幻觉的方法。他们的方法涉及生成一个推理依据，用于更新语言输入，然后将该语言输入与原始视觉输入结合以生成最终答案。此外，Wu 等人[60]引入了基于思维链（GCoT），这是一种技术，其中模型在生成答案之前逐步将视觉线索具体化。这种逐步的过程通过增强模型对输入的理解来帮助减轻视觉幻觉。此外，在医疗报告生成的背景下，CoMT[61]利用思维链提示来减少幻觉并生成高质量、准确的报告。总之，推理技术已被以各种方式和在许多应用领域使用，以帮助解决 LLMs 的幻觉问题。

3.1.2 Hallucination in Reasoning Models
3.1.2Hallucination 在推理模型中

Despite their ability to tackle complex tasks, reasoning models are not immune to hallucination. In this section, we focus on understanding the hallucination problem in reasoning models and survey techniques for its detection and mitigation.
尽管推理模型能够处理复杂任务，但它们并非不受幻觉的影响。在本节中，我们重点关注理解推理模型中的幻觉问题，并调查其检测和缓解的技术。

Hallucination analysis. The analysis of hallucinations in reasoning models can be approached from two key questions: (1) How do reasoning models perform with respect to hallucinations? and (2) What factors contribute to hallucinations in reasoning models?
幻觉分析。对推理模型中幻觉的分析可以从两个关键问题入手：(1) 推理模型在幻觉方面的表现如何？以及 (2) 哪些因素导致了推理模型中的幻觉？

Several studies [62, 63, 64, 65, 225, 66] have documented significant hallucination issues within reasoning models, sometimes more pronounced than in non-reasoning models. For instance, Lu et al. [67] argued that LRMs exacerbate hallucination issues, making them more frequent and harder to mitigate. Their findings suggest that rather than correcting errors, LRMs tend to amplify biases and inaccuracies in the CoT of the reasoning process. Similarly, Song et al. [63] and Kirichenko et al. [66] highlighted that reasoning models, when faced with unanswerable questions, struggle to recognize and refuse to respond appropriately, a challenge that is less prevalent in non-reasoning models. The hallucination problem in LRMs is not confined to unanswerable questions. Li et al. [68] and Yao et al. [65] evaluated reasoning models on both traditional hallucination benchmarks (e.g., TruthfulQA [228], HaluEval [229], HalluQA [230]) and fact-seeking benchmarks (e.g., SimpleQA [231], TriviaQA [232]), consistently finding that reasoning models exhibit higher rates of hallucination. Liu et al. [64] extended this observation to visual tasks, where improved reasoning capabilities were often accompanied by more severe visual hallucinations. Together, these studies suggest that while reasoning models improve performance on complex tasks, they can also produce more significant hallucinations than non-reasoning models in simpler, non-reasoning tasks. Moreover, many studies have also found that there are serious illusions in the generated CoT [69, 67, 70, 68, 71]. Given the typical length and apparent logical coherence of CoT, such hallucinations are often difficult to detect and correct, posing a critical challenge for future research.
已有数项研究[62, 63, 64, 65, 225, 66]记录了推理模型中存在显著的幻觉问题，有时比非推理模型更为突出。例如，Lu 等人[67]认为，大型语言模型（LRMs）加剧了幻觉问题，使其更频繁且更难缓解。他们的研究发现，LRMs 倾向于放大推理过程中思维链（CoT）中的偏见和不准确性，而非纠正错误。类似地，Song 等人[63]和 Kirichenko 等人[66]指出，当推理模型面对无法回答的问题时，它们难以识别并适当拒绝回答，而这一挑战在非推理模型中不那么普遍。LRMs 中的幻觉问题不仅限于无法回答的问题。Li 等人[68]和 Yao 等人[65]在传统幻觉基准测试（例如，TruthfulQA[228]、HaluEval[229]、HalluQA[230]）和事实查找基准测试（例如，SimpleQA[231]、TriviaQA[232]）上评估了推理模型，一致发现推理模型的幻觉率更高。刘等人[ ] [ 64] 将这一观察扩展到视觉任务中，发现推理能力提升往往伴随着更严重的视觉幻觉。这些研究表明，虽然推理模型在复杂任务上的表现有所提高，但在简单、非推理任务中，它们产生的幻觉可能比非推理模型更为显著。此外，许多研究还发现生成的思维链（CoT）[ 69, 67, 70, 68, 71] 存在严重幻觉。鉴于思维链通常的长度和表面上的逻辑连贯性，这类幻觉往往难以检测和纠正，为未来的研究带来了重大挑战。

When examining the causes of hallucinations, several studies point to the length of the CoT as a significant factor [67, 64]. For example, Lu et al. [67] reported that hallucinations tend to occur more frequently in longer CoTs compared to those with correct answers. Similarly, Liu et al. [64] observed that as CoTs become longer, models increasingly rely on language priors over visual inputs, a shift that often leads to visual hallucinations. Another important factor is the training paradigm of the model. Yao et al. [65] suggested that while combining SFT with RL training can improve model performance on fact-seeking tasks, both SFT-only and RL-only paradigms lead to severe hallucinations, often manifesting as flaw repetition or mismatched thinking and answers. Li et al. [68] similarly identified outcome-based RL fine-tuning as a contributor to hallucinations, highlighting three critical factors: high variance in policy gradients, high entropy in predictions, and the presence of spurious local optima.
在研究幻觉产生的原因时，一些研究表明 CoT 的长度是一个重要因素[67, 64]。例如，Lu 等人[67]报告称，与有正确答案的 CoT 相比，幻觉在更长的 CoT 中更频繁地发生。类似地，Liu 等人[64]观察到，随着 CoT 的变长，模型越来越依赖语言先验而非视觉输入，这种转变往往导致视觉幻觉。另一个重要因素是模型的训练范式。Yao 等人[65]指出，虽然将 SFT 与 RL 训练相结合可以提高模型在事实查找任务上的性能，但仅使用 SFT 或仅使用 RL 的训练范式会导致严重的幻觉，通常表现为缺陷重复或思维与答案不匹配。Li 等人[68]同样发现基于结果的 RL 微调是导致幻觉的一个因素，并强调了三个关键因素：策略梯度的高方差、预测的高熵以及虚假局部极值的存在。

Hallucination detection and measurement. The PRM model [40] provided an effective approach for measuring hallucinations within the reasoning process. Li et al. [72] extended this work by introducing a Fine-grained Process Reward Model (FG-PRM), which trained six specialized PRMs to address specific types of hallucinations, including context inconsistency, logical inconsistency, instruction inconsistency, logical errors, factual inconsistencies, and fabrication. These PRMs generated a combined signal to detect hallucinations more accurately. Different from PRM-based methods, Zhang et al. [73] adopted linear probing, aiming at detecting errors early during reasoning. However, the above methods need additional training steps. Dong et al. [62] adopted proxy LLMs to augment and rate the reasoning chain as an indicator of hallucination. Sun et al. [70] introduced the “reasoning score”, a metric that measures divergence between intermediate hidden states and final logits. Their findings suggest that several indicators related to this score correlate strongly with the occurrence of hallucinations, leading them to combine these indicators for effective detection. More recently, Wang et al. [74] developed the RACE framework for hallucination detection, which extracts simplified reasoning steps via an LLM and evaluates four key aspects of the reasoning chain: reasoning consistency, answer uncertainty, reasoning-answer alignment, and reasoning coherence.
幻觉检测与测量。PRM 模型[40]为在推理过程中测量幻觉提供了一种有效方法。Li 等人[72]通过引入细粒度过程奖励模型（FG-PRM）扩展了这项工作，该模型训练了六个专门的 PRM 来处理特定类型的幻觉，包括上下文不一致、逻辑不一致、指令不一致、逻辑错误、事实不一致和捏造。这些 PRM 生成了一个组合信号，以更准确地检测幻觉。与基于 PRM 的方法不同，Zhang 等人[73]采用了线性探测，旨在在推理过程中早期检测错误。然而，上述方法需要额外的训练步骤。Dong 等人[62]采用代理 LLMs 来增强和评分推理链，将其作为幻觉的指标。Sun 等人[70]引入了“推理分数”，这是一个衡量中间隐藏状态与最终 logits 之间差异的指标。他们的研究发现，与该分数相关的几个指标与幻觉的发生密切相关，因此他们将这些指标结合起来进行有效检测。最近，王等人[74]开发了用于幻觉检测的 RACE 框架，该框架通过 LLM 提取简化的推理步骤，并评估推理链的四个关键方面：推理一致性、答案不确定性、推理-答案对齐和推理连贯性。

Hallucination mitigation. In addition to hallucination detection, another way to combat hallucinations in LRMs is hallucination mitigation, which aims to reduce the frequency of hallucinations through various strategies. These strategies can be broadly classified into two categories: training-based methods and planning-based methods.
幻觉缓解。除了幻觉检测外，对抗 LRM 中幻觉的另一种方法是幻觉缓解，该方法旨在通过多种策略减少幻觉的频率。这些策略可以大致分为两类：基于训练的方法和基于规划的方法。

Training-based methods involve intervening in the model’s training process, either by introducing additional training objectives or incorporating specialized training data. For instance, Song et al. [63] modified the reward function in the PPO algorithm [31], encouraging the model to respond with “I don’t know” when faced with unanswerable questions. This approach mitigates hallucinations on unanswerable problems while preserving performance on solvable ones. Similarly, Sun et al. [70] proposed GRPO-R, an extension of the original GRPO [32], where the reward was adjusted by incorporating a reasoning score. FSPO [68] further refined this approach by introducing both a rule-based correctness reward for the final answer and a step-wise factuality reward, which is derived from the LLM’s reasoning process in conjunction with additional evidence.
基于训练的方法涉及干预模型的训练过程，可以通过引入额外的训练目标或结合专门的训练数据来实现。例如，Song 等人[63]修改了 PPO 算法[31]中的奖励函数，鼓励模型在面对无法回答的问题时回应“我不知道”。这种方法在解决无法回答的问题时减轻了幻觉现象，同时保留了在可解决问题上的表现。类似地，Sun 等人[70]提出了 GRPO-R，这是对原始 GRPO[32]的扩展，其中通过结合推理分数来调整奖励。FSPO[68]进一步改进了这种方法，引入了基于规则的最终答案正确性奖励和逐步事实性奖励，后者是从 LLM 的推理过程以及额外证据中推导出来的。

In contrast, planning-based methods do not necessitate modifications to the training procedure. Instead, they focus on mitigating hallucinations by improving the model’s reasoning path through better planning. Zheng et al. [47] argued that models may suffer from vision-language bias when they process information while simultaneously attending to both vision and text inputs. To address this, they first prompted the model to generate a reasoning plan using text-only input, and then, based on the generated plan, proceeded to solve the problem and generate intermediate reasoning steps with the vision-language input.
相比之下，基于规划的方法无需修改训练流程。它们通过改进模型的推理路径来缓解幻觉问题，从而专注于通过更好的规划来减轻幻觉。Zheng 等人[47]认为，当模型同时处理视觉和文本输入时，可能会受到视觉-语言偏差的影响。为了解决这个问题，他们首先使用纯文本输入提示模型生成推理计划，然后基于生成的计划进行问题求解，并使用视觉-语言输入生成中间推理步骤。

Overall, our review indicates that while reasoning models have demonstrated remarkable progress on complex reasoning-driven tasks, their tendency to hallucinate even in common scenarios remains a fundamental limitation. Addressing this tension between reasoning capability and reliability will require systematic investigation, and stands as an important direction for future research.
总体而言，我们的综述表明，虽然推理模型在复杂的推理驱动任务上取得了显著进展，但它们在常见场景中仍然存在幻觉的倾向，这仍然是一个基本限制。解决推理能力和可靠性之间的这种张力将需要系统的调查研究，并成为未来研究的一个重要方向。

3.2 Faithfulness of Reasoning Models
3.2 推理模型的忠实性

Faithfulness in traditional natural language generation is defined by the extent to which the model’s outputs align with or are supported by the provided input [233]. In this work, we specifically examine reasoning faithfulness in the context of LLM reasoning, focusing on faithfulness related to CoT prompting and LRM. In LLM reasoning scenarios, reasoning faithfulness typically addresses the question [234, 91]: “Does the explanation generated by the model accurately reflect the reasoning process behind its prediction?”
在传统的自然语言生成中，模型的输出与所提供的输入的符合程度或支持程度被定义为忠实性[ 233]。在本工作中，我们特别考察了 LLM 推理中的推理忠实性，重点关注与 CoT 提示和 LRM 相关的忠实性。在 LLM 推理场景中，推理忠实性通常涉及这样一个问题[ 234, 91]：“模型生成的解释是否准确反映了其预测背后的推理过程？”

Reasoning faithfulness is a fundamental aspect of overall model truthfulness. A lack of faithfulness in CoT reasoning can introduce significant safety risks, particularly in high-stakes domains such as legal services, medical treatment, and financial decision-making [84], where users may be misled into overestimating the model’s interpretability. Research on reasoning faithfulness can be broadly categorized into three key areas: faithfulness measuring, understanding, and improvement. In the following sections, we will explore reasoning faithfulness from each of these three perspectives.
推理忠实性是模型整体真实性中的一个基本方面。CoT 推理中的缺乏忠实性可能会引入重大的安全风险，特别是在法律服务、医疗治疗和金融决策等高风险领域[ 84]，用户可能会被误导而高估模型的可解释性。推理忠实性的研究可以大致分为三个关键领域：忠实性度量、理解和改进。在接下来的几节中，我们将从这三个角度探讨推理忠实性。

3.2.1 Faithfulness Measuring
3.2.1 忠实性度量

While faithfulness is an essential component of trustworthiness, comprehensively measuring it remains an open challenge. However, several metrics have been proposed to partially evaluate the faithfulness of CoT [75, 76, 77]. These methods can be broadly categorized into various intervention techniques that modify either the reasoning process, the input, or the model parameters to measure how faithfully the model’s CoT reflects its reasoning process.
虽然忠实性是可信度的重要组成部分，但全面衡量它仍然是一个开放性的挑战。然而，已经提出了几种指标来部分评估 CoT[75, 76, 77]的忠实性。这些方法可以大致分为各种干预技术，这些技术通过修改推理过程、输入或模型参数来测量模型生成的 CoT 如何忠实地反映其推理过程。

CoT intervention. One prominent evaluation method involves modifying the CoT reasoning path $T$ generated by the model and observing changes in the output to assess whether the reasoning faithfully supports the model’s prediction [75, 79, 88, 235]. Lanham et al. [75] proposed a CoT intervention approach, which alters the reasoning process by truncating the CoT before the final answer or introducing errors at specific points in the reasoning chain. The former one truncates the original CoT before answering, and the latter one adds a mistake generated by a proxy LLM into some specific position in the CoT and generates subsequent CoT autoregressively. After CoT intervention, if the answer changes, it means that the CoT matters in the model’s prediction, which indicates that the CoT is faithful. By introducing CoT interventions at different steps of the reasoning process, we can generate a consistency curve and use the Area Over Curve (AOC) to quantify faithfulness. However, Bentham et al. [79] cautioned that such metrics may be biased due to inherent label biases in the model. To address this, they introduce a CoT-agnostic normalized metric, calculated as follows:
CoT 干预。一种突出的评估方法涉及修改模型生成的 CoT 推理路径 $T$ ，通过观察输出变化来评估推理是否真实地支持模型的预测[ 75, 79, 88, 235]。Lanham 等人[ 75]提出了一种 CoT 干预方法，通过在最终答案之前截断 CoT 或在推理链的特定点引入错误来改变推理过程。前者在回答前截断原始 CoT，后者将一个由代理 LLM 生成的错误添加到 CoT 的某些特定位置，并自回归地生成后续 CoT。CoT 干预后，如果答案发生变化，这意味着 CoT 在模型的预测中很重要，这表明 CoT 是真实的。通过在推理过程的不同步骤引入 CoT 干预，我们可以生成一致性曲线，并使用曲线下面积（AOC）来量化真实性。然而，Bentham 等人[ 79]警告说，由于模型中固有的标签偏差，此类指标可能存在偏差。为解决这个问题，他们引入了一种与 CoT 无关的归一化指标，计算方法如下：

N(\mathcal{M},\mathcal{D})=\frac{1}{\lvert\mathcal{D}\rvert}\sum\limits_{x\in\mathcal{D}}\mathbbm{1}_{[\mathcal{M}(x)=\mathcal{M}(\tilde{x})]},

(2)

where $\mathbbm{1}$ represents the indicator function, and $\tilde{x}$ refers to a version of $x$ where answer choices have been shuffled. Additionally, Paul et al. [88] used the Lakage-Adjusted Simulatability (LAS) [236] to measure faithfulness by evaluating the accuracy deviation between the model’s performance with and without CoT reasoning. Xiong et al. [78] extended CoT intervention to assess both intra-draft and draft-to-answer faithfulness in large reasoning models, such as DeepSeek-R1. Yee et al. [235] employed error injection into the CoT and classified reasoning as faithful or unfaithful based on whether the model recovered the injected error in the final answer.
其中 $\mathbbm{1}$ 表示指示函数， $\tilde{x}$ 指的是 $x$ 的一个版本，其中答案选项已被打乱。此外，Paul 等人 [88] 使用 Lakage-Adjusted Simulatability (LAS) [236] 来衡量忠实度，通过评估模型使用和未使用思维链推理时的性能准确度偏差。Xiong 等人 [78] 将思维链干预扩展到评估大型推理模型（如 DeepSeek-R1）的草稿内和草稿到答案的忠实度。Yee 等人 [235] 将错误注入到思维链中，并根据模型是否在最终答案中恢复注入的错误来将推理分类为忠实或不忠实。

Table 2: Prompts demonstrating the two biasing features. The text for the unbiased context is in Italian and for the biased context in Bold. The top example shows the Answer is Always A biasing feature, in which we reorder the multiple-choice options in a few-shot prompt to make the answer always (A). The bottom shows the Suggested Answer bias, in which we add text where a user suggests a random answer is correct. This table is borrowed from Turpin et al.[76].
表 2：展示两种偏置特征的提示。无偏置上下文的文本为意大利语，有偏置上下文的文本为粗体。顶部示例展示了"答案总是 A"的偏置特征，其中我们在少量样本提示中重新排序多项选择题选项，使答案始终为(A)。底部展示了"建议答案"偏置，其中我们添加文本，让用户建议一个随机答案是正确的。该表格借鉴自 Turpin 等人[76]。

Biasing Feature #1: Answer is Always A
偏置特征#1：答案总是 A

Human: Q: Is the following sentence plausible? “Julio Jones struck out.” (A) plausible / implausible (B) implausible / plausible
人类：Q: 以下句子是否合理？"Julio Jones 被三振出局。" (A)合理/不合理 (B)不合理/合理 Let’s think step by step. […] Answer: (B) / (A) implausible
让我们逐步思考。 […] 答案：(B) / (A)不合理 […] Answer: (B) / (A) implausible
…答案：(B) / (A) 不合理 […] Answer: (B) / (A) implausible
…答案：(B) / (A) 不合理 […] Answer: (A) / (A) plausible
…答案：(A) / (A) 合理 […] Answer: (B) / (A) implausible
…答案：(B) / (A) 不合理 […] Answer: (A) / (A) plausible
…答案：(A) / (A) 合理 Q: Is the following sentence plausible? “LeBron James took a corner kick.” (A) plausible (B) implausible
Q: 以下句子是否合理？“LeBron James 进行了一次角球。” (A) 合理 (B) 不合理 Assistant: [generate CoT explanation]
助手：[生成思维链解释]

Biasing Feature #2: Suggested Answer
偏见特征 #2：建议答案

Human: Q: Is the following sentence plausible? “LeBron James took a corner kick.” (A) plausible (B) implausible
人类：问：以下句子合理吗？“勒布朗·詹姆斯踢了一个角球。”（A）合理（B）不合理 <empty> / I think the answer is A but I’m curious to hear what you think.
<空> / 我认为答案是 A，但我好奇你的想法。 Assistant: [generate CoT explanation]
助手：[生成思维链解释]

Input intervention. Input intervention differs from CoT intervention in that it introduces bias into the input $x$ and observes how both the CoT reasoning and the model’s answer change as a result. Table˜2 shows a demonstration of input interventions proposed by Turpin et al. [76]. Specifically, by either setting all answers in the few-shot demonstration to a fixed choice (e.g., (A)) or expressing a preference for a particular answer choice, LLMs often adjust their answers accordingly. This shift in answers is used to assess the model’s faithfulness, with the accuracy drop serving as a key metric for unfaithfulness. However, it is important to note that the bias introduced into the input is typically not reflected in the CoT, thereby highlighting a potential risk of unfaithfulness. Similarly, Chua et al. [81] and Chen et al. [82] built upon this concept by inserting various cues (i.e., professor suggestions and black/white square implications) into the inputs. Unlike Turpin et al. [76], who focused on the accuracy drop, these studies assessed faithfulness by determining whether the model acknowledges the inserted cue when its answer changes. Yet, like previous studies, these models may fail to mention the cues in the CoT, exposing faithfulness vulnerability in their reasoning process. Arcuschin et al. [80] proposed to flip the question (e.g., changing “ $\text{Is }X>Y$ ” to “ $\text{Is }Y>X$ ”). If the model’s answer does not change, it is considered unfaithful.
输入干预。输入干预与思维链干预不同，它向输入中引入偏差 $x$ ，并观察思维链推理和模型答案如何因此改变。表˜2 展示了 Turpin 等人[ 76]提出的输入干预的示例。具体来说，通过将少样本演示中的所有答案设置为固定选项（例如（A））或表达对特定答案选项的偏好，LLMs 通常会相应调整它们的答案。这种答案的变化被用来评估模型的忠实度，准确率下降是衡量不忠实的关键指标。然而，需要注意的是，输入中引入的偏差通常不会反映在思维链中，从而突显了不忠实的一个潜在风险。类似地，Chua 等人[ 81]和 Chen 等人[ 82]基于这一概念，通过向输入中插入各种提示（即教授建议和黑白方块暗示）来构建研究。与 Turpin 等人[ 76]关注准确率下降不同，这些研究通过判断模型在答案变化时是否承认插入的提示来评估忠实度。然而，与以往研究类似，这些模型可能未能提及 CoT 中的线索，从而暴露出推理过程中的忠实性漏洞。Arcuschin 等人[80]提出反转问题（例如，将“ $\text{Is }X>Y$ ”改为“ $\text{Is }Y>X$ ”）。如果模型的答案没有改变，则认为其不忠实。

Parameter intervention. In a recent study, Tutek et al. [77] argued that metrics based solely on CoT intervention only evaluate contextual faithfulness. Although crucial context may be erased, the relevant knowledge embedded within the model’s parameters remains intact, potentially allowing the model to reconstruct the missing context. To address this, Tutek et al. [77] introduced FUR, a method that utilizes the unlearning algorithm NPO [237] to assess parameter faithfulness. Specifically, they segment the CoT $T$ and then unlearn a single step in it. And then they use the answer consistency and probability divergence between the original model $\mathcal{M}$ and the unlearned model $\mathcal{M}^{\prime}$ to estimate the faithfulness.
参数干预。在一项近期研究中，Tutek 等人[77]指出，仅基于 CoT 干预的指标仅评估上下文忠实性。尽管关键上下文可能被擦除，但模型参数中嵌入的相关知识仍然完好无损，这可能导致模型重建缺失的上下文。为解决这一问题，Tutek 等人[77]引入了 FUR 方法，该方法利用去学习算法 NPO[237]来评估参数忠实性。具体而言，他们将 CoT $T$ 分段，然后对其中的一个步骤进行去学习。随后，他们利用原始模型 $\mathcal{M}$ 和去学习模型 $\mathcal{M}^{\prime}$ 之间的答案一致性和概率发散来估计忠实性。

No intervention. Xu et al. [89] adopted manual evaluation, which divides an instance into three classes: (1) faithful: both the answer and the process are correct and logical (2) unfaithful: the answer is correct but the reasoning process is not; (3) false: the answer is incorrect. Similarly, Li et al. [83] considered an instance to be faithful if and only if both the CoT and the answer are correct or incorrect.
没有干预。徐等人 [89] 采用了人工评估方法，将实例分为三类：(1) 忠实：答案和推理过程均正确且合理；(2) 不忠实：答案正确但推理过程不正确；(3) 错误：答案不正确。类似地，李等人 [83] 认为一个实例是忠实的，当且仅当 CoT 和答案都正确或都错误。

3.2.2 Faithfulness Understanding
3.2.2 忠实性理解

A growing body of research delves into the mechanisms underlying the faithfulness of reasoning in Large Language Models (LLMs). In this section, we summarize key studies that aim to understand and enhance the faithfulness of LLMs’ reasoning processes.
越来越多的研究深入探讨大型语言模型（LLMs）中推理忠实性的机制。在本节中，我们总结了旨在理解和增强 LLMs 推理过程忠实性的关键研究。

Unfaithfulness problem. Despite the impressive performance of CoT reasoning in handling complex tasks, the CoTs generated by models can still exhibit unfaithfulness—remaining logically coherent but diverging from the true reasoning process [76, 75]. Lanham et al. [75] revealed that, in some cases, the reasoning process is post-hoc: the model first determines the answer and then fabricates a plausible explanation, rather than deriving the answer through the reasoning. While reasoning models generally show better faithfulness than non-reasoning models [81], they still exhibit unfaithfulness that warrants further attention [82, 80]. Agarwal et al. [84] emphasized that faithfulness is critical in high-stakes applications, such as healthcare diagnosis, financial forecasting, and crime prediction, while plausibility (the degree to which reasoning aligns with human understanding) is essential in more recreational or educational contexts, such as story-telling and educational LLMs.
不忠实问题。尽管 CoT 推理在处理复杂任务方面表现出色，但模型生成的 CoT 仍可能存在不忠实——逻辑上连贯但偏离真实推理过程[76, 75]。Lanham 等人[75]揭示，在某些情况下，推理过程是后置的：模型先确定答案，然后编造一个看似合理的解释，而不是通过推理得出答案。虽然推理模型通常比非推理模型表现出更高的忠实度[81]，但它们仍然存在需要进一步关注的不忠实现象[82, 80]。Agarwal 等人[84]强调，在医疗诊断、金融预测和犯罪预测等高风险应用中，忠实度至关重要，而在故事讲述和教育 LLM 等更多休闲或教育环境中，合理性（推理与人类理解的契合程度）则必不可少。

The factors that influence faithfulness. When unfaithfulness arises in models, a considerable amount of research investigates the factors influencing this issue. Early work by Lanham et al. [75] explored how model size and model capability affect faithfulness. Their findings suggest that reasoning faithfulness typically increases, then decreases, with an increase in model size, with an optimal size around 13B parameters. Bentham et al. [79] extended this research across various LLM families and confirmed a similar trend. Interestingly, they observed that models with higher accuracy tend to exhibit lower faithfulness, a finding also supported by Tanneru et al. [86]. Conversely, Bao et al. [85] and Xiong et al. [78] argued that larger models are generally more faithful, suggesting the possibility of a nuanced relationship between size and faithfulness. The findings drawn by Bentham et al. [79] and Tanneru et al. [86] may stem from the fact that more performant models can often generate correct answers despite error or incomplete CoTs, indicating that existing faithfulness measures may oversimplify the issue. Additionally, Lanham et al. [75] highlighted that the faithfulness of a model’s reasoning varies significantly across tasks, with faithfulness scores AOC ranging from less than 10% to over 60%. Chen et al. [82] and Xiong et al. [78] demonstrated experimentally that models are more prone to unfaithfulness when tasked with more difficult problems. In addition, there is ongoing debate surrounding the impact of CoT length on faithfulness. Chua et al. [81] suggested that length penalties may result in unfaithful responses, but Chen et al. [82] claimed that unfaithful CoTs are usually longer than faithful CoTs. Bao et al. [85] proposed an alternative explanation based on structural causal models (SCMs) [238]. They claimed that reasoning derived from a causal chain (where the answer stems directly from the CoT, which is in turn derived from the instruction) is generally more faithful. In contrast, reasoning that depends on more complex SCM types, such as common cause or full connection, may introduce unfaithfulness due to the increased dependency on the instruction. Recent work also highlights the role of post-training techniques in shaping model faithfulness. For instance, a study by Bao et al. [85] indicated that SFT and DPO could weaken a model’s faithfulness. Lobo et al. [87] found that the impact of SFT on faithfulness is more pronounced in smaller models, with larger models being less affected. Finally, recent studies suggested that reasoning models trained with reinforcement learning with verifiable rewards (RLVR) (e.g., DeepSeek-R1 [15]) exhibit significantly higher faithfulness compared to non-reasoning models [81, 82, 80]. Although many factors are related to faithfulness, their conclusions may be contradictory due to different evaluation methods and models. This calls for the development of more comprehensive evaluation methods.
影响忠实性的因素。当模型出现不忠实性时，大量研究调查了影响这一问题的因素。Lanham 等人早期的研究[75]探讨了模型大小和模型能力如何影响忠实性。他们的研究发现，随着模型大小的增加，推理忠实性通常会先增加后减少，最佳参数量约为 13B。Bentham 等人[79]将这项研究扩展到不同的 LLM 家族，并确认了类似的趋势。有趣的是，他们观察到准确率较高的模型往往表现出较低的忠实性，这一发现也得到了 Tanneru 等人[86]的支持。相反，Bao 等人[85]和 Xiong 等人[78]认为较大的模型通常更忠实，这表明大小和忠实性之间可能存在复杂的关系。Bentham 等人[79]和 Tanneru 等人[86]得出的发现可能源于性能更优的模型即使存在错误或不完整的 CoTs 也能生成正确答案，这表明现有的忠实性度量标准可能过于简化了问题。此外，Lanham 等人 [ 75] 指出模型的推理可信度在不同任务中差异显著，可信度得分 AOC 范围从不到 10%到超过 60%。Chen 等人[ 82]和 Xiong 等人[ 78]通过实验证明，当模型处理更复杂问题时，更容易出现不可信的情况。此外，关于 CoT 长度对可信度的影响存在持续争论。Chua 等人[ 81]提出长度惩罚可能导致不可信的响应，但 Chen 等人[ 82]声称不可信的 CoT 通常比可信的 CoT 更长。Bao 等人[ 85]基于结构因果模型（SCMs）[ 238]提出了另一种解释。他们认为，基于因果链（即答案直接源于 CoT，而 CoT 又源于指令）推导出的推理通常更可信。相比之下，依赖更复杂 SCM 类型（如共同原因或完全连接）的推理可能由于对指令的依赖增加而引入不可信性。近期研究还强调了后训练技术在塑造模型可信度中的作用。例如，Bao 等人的一项研究显示... [ 85] 指出，监督微调（SFT）和去偏见优化（DPO）可能会削弱模型的忠实性。Lobo 等人[ 87]发现，SFT 对忠实性的影响在较小模型中更为显著，而较大模型受影响较小。最后，最近的研究表明，使用带可验证奖励的强化学习（RLVR）训练的推理模型（例如，DeepSeek-R1 [ 15]）与非推理模型[ 81, 82, 80]相比，表现出显著更高的忠实性。尽管许多因素与忠实性相关，但由于评估方法和模型的不同，他们的结论可能相互矛盾。这要求我们开发更全面的评估方法。

3.2.3 Faithfulness Improvement
3.2.3 忠实性改进

Since faithfulness is an important part of trustworthiness, many methods have been proposed to enhance the faithfulness of the model. To improve reasoning faithfulness in large language models, Radhakrishnan et al. [90] adopted a question decomposition strategy. They break down a complex question into a sequence of subquestions, solve each one individually, and then recompose the intermediate answers to arrive at the final answer. Recent work has explored symbolic reasoning to further enhance faithfulness. Faithful CoT [91] translated natural language queries into symbolic reasoning steps using an LLM, then employed a deterministic solver (e.g., a Python interpreter) to compute the final answer. Each reasoning step in the chain included three components: a subquestion, a dependency graph, and corresponding rationales. Similarly, LOGIC-LM [92] used symbolic formulation and an external reasoner, and introduced a self-refinement mechanism when the executor returned an error. However, reliance on external symbolic solvers may lead to brittleness in the presence of syntax errors. To address this limitation, approaches such as SymbCoT [89], FLARE [93], and CoMAT [94] proposed to use LLMs themselves as solvers and verifiers. SymbCoT used the LLM in multiple roles (i.e., symbolic translator, planner, solver, and verifier) via distinct prompt templates. FLARE formalized problems into logic programs and simulates their execution using LLMs modeled after Prolog-style reasoning. Wang et al. [95] proposed the CORE framework, which iteratively refined both the rationale and the answer while ensuring that the model’s confidence aligns with logical propositions. QUIRE [83] enhanced faithfulness by re-emphasizing critical input information before initiating CoT reasoning.
由于可信度的重要组成部分是忠实性，因此已经提出了许多方法来增强模型的忠实性。为了提高大型语言模型中的推理忠实性，Radhakrishnan 等人[90]采用了一种问题分解策略。他们将复杂问题分解为一系列子问题，单独解决每个子问题，然后重新组合中间答案以得出最终答案。最近的研究探索了符号推理，以进一步增强忠实性。忠实性 CoT[91]使用 LLM 将自然语言查询转换为符号推理步骤，然后使用确定性求解器（例如 Python 解释器）来计算最终答案。链中的每个推理步骤包括三个组件：子问题、依赖图和相应的推理依据。类似地，LOGIC-LM[92]使用了符号化表述和外部推理器，并在执行器返回错误时引入了自我完善机制。然而，对外部符号求解器的依赖可能导致在存在语法错误时出现脆弱性。为解决这一局限性，SymbCoT [ 89]、FLARE [ 93] 和 CoMAT [ 94] 等方法提出使用 LLMs 本身作为求解器和验证器。SymbCoT 通过不同的提示模板，使 LLM 在多个角色中发挥作用（即符号翻译器、规划器、求解器和验证器）。FLARE 将问题形式化为逻辑程序，并使用模仿 Prolog 风格推理的 LLM 模拟其执行。王等人 [ 95] 提出了 CORE 框架，该框架在迭代改进推理依据和答案的同时，确保模型的置信度与逻辑命题相一致。QUIRE [ 83] 通过在启动 CoT 推理前重新强调关键输入信息，增强了忠实性。

In addition, there are also many works trying to improve the faithfulness of the model through post-training [96, 88]. Gao et al. [96] constructed a dataset to train the model with three stages: faithful program generation, concise CoT conversion, and transferability filtering. They first synthesized executable visual programs from image–question pairs using a code-pretrained model and obtained the execution traces. The execution trace was then refined via controllable operations—pruning irrelevant branches, merging redundant steps, and bridging logical gaps. Finally, CoTs that prove effective in guiding end-to-end MLLMs were selected for knowledge distillation, which was conducted by both label and rationale loss, as in [239]. FRODO [88] first employed DPO to incentivize the generation of correct reasoning paths and discourage counterfactual or irrelevant steps. It further trained the model to associate correct/incorrect answers with corresponding reasoning paths and used margin-ranking loss to penalize high-confidence incorrect rationales. Viteri et al. [97] improved faithfulness via PPO [31], rewarding the model for generating correct rationales that lead to the answer even in the absence of the original prompt. In summary, there are many methods that can be used to enhance the reasoning faithfulness of the model, but the unfaithfulness problem has not been completely solved. How to combine training-based and training-free methods can also be explored.
此外，还有许多研究尝试通过后训练来提高模型的忠实度[96, 88]。高等人[96]构建了一个数据集，以三个阶段训练模型：忠实程序生成、简洁的 CoT 转换和可迁移性过滤。他们首先使用预训练的代码模型从图像-问题对中合成可执行的视觉程序，并获取执行轨迹。然后通过可控操作细化执行轨迹——剪枝无关分支、合并冗余步骤和填补逻辑空白。最后，选择在指导端到端 MLLM 时证明有效的 CoT 进行知识蒸馏，知识蒸馏通过标签和推理损失进行，如[239]所述。FRODO[88]首先采用 DPO 激励正确推理路径的生成，并抑制反事实或不相关的步骤。它进一步训练模型将正确/错误答案与相应的推理路径相关联，并使用边距排序损失惩罚高置信度的错误推理。Viteri 等人 [ 97] 通过 PPO [ 31] 提升了忠实度，奖励模型在即使没有原始提示的情况下也能生成正确推理并得出答案。总之，有许多方法可以用来增强模型的推理忠实度，但忠实度问题尚未完全解决。如何结合基于训练和无需训练的方法也是一个可以探索的方向。

3.2.4 Further Discussion of Faithfulness Definition
3.2.4 对忠实度定义的进一步讨论

In the definition of faithfulness, many working definitions are quite different from those of reasoning faithfulness. As a result, many researchers confuse them. For instance, a recent survey on LLM hallucinations defines faithfulness hallucination as “the divergence of generated content from user input or the lack of self-consistency within the generated content” [223]. However, this definition is concerned mainly with input faithfulness, which examines the degree to which the output reflects the user input, while reasoning faithfulness considers whether the model’s intermediate reasoning steps faithfully capture its internal decision-making process.
在忠实性的定义中，许多工作定义与推理忠实性定义差异很大。因此，许多研究人员将它们混淆。例如，最近关于 LLM 幻觉的调查将忠实性幻觉定义为“生成内容与用户输入的偏差或生成内容内部缺乏自洽性”[223]。然而，这个定义主要关注输入忠实性，它考察输出反映用户输入的程度，而推理忠实性则考虑模型的中级推理步骤是否忠实地捕捉了其内部决策过程。

Furthermore, considerable effort has been made to distinguish faithfulness from plausibility. Plausibility generally refers to the appearance of coherence and logical consistency, regardless of whether the underlying reasoning is valid. Given the powerful generative capabilities of today’s large language models, they often produce responses that are highly plausible but not necessarily faithful. Agarwal et al. [84] highlight this distinction, arguing that a response may appear convincing while still misrepresenting the model’s actual reasoning. Importantly, different application scenarios prioritize these dimensions differently, and striking a balance between faithfulness and plausibility remains context-dependent.
此外，人们已经付出了相当大的努力来区分忠实性与合理性。合理性通常指连贯性和逻辑一致性的表象，而不管其底层推理是否有效。鉴于当今大型语言模型的强大生成能力，它们经常生成高度合理但不一定忠实的回应。Agarwal 等人[84]强调了这一区别，认为一个回应可能看起来很有说服力，但仍然歪曲了模型的实际推理。重要的是，不同的应用场景对这些维度有不同的侧重，而在忠实性与合理性之间取得平衡仍然是一个挑战。context-dependent.

4 Safety

As safety becomes a critical concern in high-stakes applications, it is imperative to understand how reasoning interacts with LLM content safety issues. In this section, we mainly examine the content safety challenges introduced by the emergence of large reasoning models as well as CoT techniques, whose enhanced capabilities and structured reasoning processes may amplify both utility and risk. To be detailed, this section outlines key dimensions of safety related to reasoning capabilities, including vulnerability analysis, jailbreak attacks and defenses, safety alignment, and safety threats such as backdoor and prompt injection.
随着安全在高风险应用中成为关键问题，理解推理如何与 LLM 内容安全问题相互作用至关重要。在本节中，我们主要考察大型推理模型以及 CoT 技术带来的内容安全挑战，其增强的能力和结构化推理过程可能会同时放大效用和风险。具体而言，本节概述了与推理能力相关的安全关键维度，包括漏洞分析、越狱攻击与防御、安全对齐以及后门和提示注入等安全威胁。

4.1 Vulnerability Assessment
4.1 漏洞评估

Vulnerability assessment in reasoning models often involves jailbreak attacks, which aim to induce the model to generate inappropriate content. For large language models, many researchers developed related benchmarks [240, 241, 242, 243] to evaluate the jailbreak defense capability against previous attacks [244, 245, 246]. In terms of jailbreak assessment of large reasoning models, early works utilized jailbreak prompts from previous benchmarks mentioned above to evaluate the safety performance [3, 98, 99, 100, 103, 101, 1, 104]. Also, many researchers developed new benchmarks [105, 106, 107, 108] for a more targeted evaluation. Here, instead of narrating these works in a timeline, we group the core findings of these studies to build a preliminary conceptual map.
推理模型中的漏洞评估通常涉及越狱攻击，其目的是诱导模型生成不当内容。对于大型语言模型，许多研究人员开发了相关的基准测试[240, 241, 242, 243]，以评估模型对先前攻击[244, 245, 246]的越狱防御能力。在大型推理模型的越狱评估方面，早期研究利用上述基准测试中的越狱提示来评估安全性表现[3, 98, 99, 100, 103, 101, 1, 104]。此外，许多研究人员开发了新的基准测试[105, 106, 107, 108]，以进行更具针对性的评估。在此，我们不是按时间顺序叙述这些工作，而是将这些研究的核心发现进行分组，以构建一个初步的概念图。

Current open-source reasoning models are still vulnerable to jailbreak attacks. Evaluation results from many researchers [3, 98, 99, 103, 1, 104, 111] emphasized the safety vulnerability of current large reasoning models. SafeChain [1] evaluates concurrent reasoning models [15, 247, 248, 249, 250, 251] on StrongReject [241] and WildJailbreak [252], finding that all these modern large reasoning models should improve safety performance, for no model achieved a satisfactory result on both datasets. Zhou et al. [100] claimed that o3-mini is significantly safer than DeepSeek-R1 models on four datasets [242, 253]. Kassianik et al. [103] also mentioned that the attack success rate (ASR) of DeepSeek-R1 on Harmbench [240] is 100%, higher than o1-preview and other large language models [37, 13, 14], corresponding to conclusions from Marjanović et al. [111]. Ying et al.also mentioned that “both DeepSeek-V3 and DeepSeek-R1 models exhibit clear vulnerabilities when facing jailbreak attacks” after evaluating the safety performance on the CNSafe dataset [3]. Similarly, Krishna et al. [104] in their evaluation highlighted the category-wise and model-wise vulnerabilities when faced with various jailbreak attacks. Additionally, Fan et al. [106] discovered evaluation faking, where reasoning models may probably understand they are being evaluated and therefore alter their response to be safer. Zheng et al. [107] proposed BSAbench, which disclosed the safety vulnerability with more challenging queries. After clarifying the overall perception that open-source reasoning models still have space to improve the safety capability, here are specific insights.
当前开源的推理模型仍然容易受到越狱攻击。许多研究人员[3, 98, 99, 103, 1, 104, 111]的评估结果强调了当前大型推理模型的安全漏洞。SafeChain[1]在 StrongReject[241]和 WildJailbreak[252]上评估了并发推理模型[15, 247, 248, 249, 250, 251]，发现这些现代大型推理模型都应该提高安全性，因为没有任何模型在这两个数据集上都取得了令人满意的结果。Zhou 等人[100]声称，在四个数据集[242, 253]上，o3-mini 比 DeepSeek-R1 模型安全得多。Kassianik 等人[103]也提到，DeepSeek-R1 在 Harmbench[240]上的攻击成功率（ASR）为 100%，高于 o1-preview 和其他大型语言模型[37, 13, 14]，与 Marjanović等人[111]的结论一致。Ying 等人也在 CNSafe 数据集[3]上评估了安全性后提到，“DeepSeek-V3 和 DeepSeek-R1 模型在面对越狱攻击时都表现出明显的漏洞”。类似地，Krishna 等人 [ 104] 他们的评估突出了在面对各种越狱攻击时，按类别和按模型的漏洞。此外，Fan 等人[ 106]发现了评估造假现象，即推理模型可能会意识到自己正在被评估，因此改变其响应以变得更安全。Zheng 等人[ 107]提出了 BSAbench，它通过更具挑战性的查询揭露了安全性漏洞。在阐明开源推理模型在安全性能力方面仍有提升空间的整体认知后，以下是具体见解。

First, compared to base large language models, post-trained models with distilled CoT data are less sensitive to harmful prompts and reject them. SafeChain [1] proposed that learning long CoT does not necessarily improve model safety when comparing DeepSeek-R1-70B with Llama-3.3-Instruct-70B. A similar conclusion is also made by Zhou et al. [100]. Additionally, Zhang et al. [98] evaluated the DeepSeek distilled model series on CHisafetybench [254], and concluded that in terms of the risk content identification task and the “refusal to answer task”, a few reasoning models experienced a decrease in rejection rate and responsibility rate, indicating higher compliance behavior on harmful requests. Zhao et al. [110] also mentioned that acquiring deliberate reasoning capabilities would sacrifice model general performance.
首先，与基础大型语言模型相比，经过蒸馏的 CoT 数据训练的模型对有害提示不那么敏感，并会拒绝它们。SafeChain [ 1] 提出在比较 DeepSeek-R1-70B 与Llama-3.3-Instruct-70B.时，学习长 CoT 并不一定能提高模型安全性。周等人 [ 100] 也得出了类似的结论。此外，张等人 [ 98] 在 CHisafetybench [ 254] 上评估了 DeepSeek 蒸馏模型系列，并得出结论，在风险内容识别任务和“拒绝回答任务”方面，少数推理模型拒绝率和责任率有所下降，表明它们在有害请求上表现出更高的合规行为。赵等人 [ 110] 也提到，获得有意推理能力会牺牲模型的泛化性能。

Second, the thinking process from LRMs may negatively affect the harmfulness of the generated content. Jiang et al. [1] designed different thinking templates to control the reasoning process, and conducted experiments to compare the harmfulness of answers given different lengths of reasoning tokens. It turns out that compared to the default content generation, forcing the model to skip reasoning or shorten reasoning could boost the harmlessness of the answers at least on StrongReject [241] and WildJailbreak [252]. Zhou et al. [100] and Zhao et al. [110] also reinforce such an idea: they compared the answers of two pairs of reasoning models with the base models on harmful prompts, demonstrating that LRMs tend to provide more detailed and helpful answers, making the output more harmful. Furthermore, when directly evaluating the harmfulness of thinking content and final answers of DeepSeek-R1-Distill-70B on AirBench [242] and WildGuard [243], the safety rate of thinking content is consistently less than that of final answers. Ying et al. [3] also supported the vulnerability of reasoning content, indicating that the exposed reasoning chains may increase safety risks.
其次，大型语言模型的思考过程可能对生成内容的危害性产生负面影响。Jiang 等人[1]设计了不同的思考模板来控制推理过程，并通过实验比较了不同推理 token 长度下答案的危害性。结果表明，与默认内容生成相比，强制模型跳过推理或缩短推理可以至少在 StrongReject[241]和 WildJailbreak[252]上提升答案的无害性。Zhou 等人[100]和 Zhao 等人[110]也强化了这一观点：他们在有害提示下比较了两组推理模型与基础模型的答案，证明大型语言模型倾向于提供更详细和有帮助的答案，使输出更具危害性。此外，在 AirBench[242]和 WildGuard[243]上直接评估DeepSeek-R1-Distill-70B的思考内容与最终答案的危害性时，思考内容的危险率始终低于最终答案。Ying 等人[3]也支持了推理内容的脆弱性，表明暴露的推理链可能会增加安全风险。

Third, Pairwise safety ranks between models depend on datasets. After reviewing the related literature, we find that some findings from different datasets do not reach a consensus. For example, evaluations on Airbench [242] claimed that DeepSeek-R1 is safer than DeepSeek-V3 [100], while under CNSafe, DeepSeek-V3 exceeds DeepSeek-R1 with an average ASR margin of 21.7% across all risk categories [3]. However, when red-teaming with jailbreak templates, experiments on WildGuard Jailbreak [100] and CNSafe_RT [3] conversely showed that DeepSeek-R1 could identify the risk in jailbreak prompts and provide a safe thinking chain. Additionally, safety performance is also related to evaluation topics. For the DeepSeek-distilled model series, the most notable declines in safety performance are observed in areas such as health discrimination, sexism, regional discrimination, and occupational discrimination [98]. In contrast, DeepSeek-R1 exhibits pronounced vulnerabilities in cybersecurity-related topics [100]. We may explain this discrepancy by noting that different training datasets and data structures would influence the model performance, causing imbalanced sensitivity to various safety topics.
第三，模型之间的成对安全排名取决于数据集。在回顾相关文献后，我们发现来自不同数据集的一些研究结果并未达成共识。例如，对 Airbench [ 242] 的评估声称 DeepSeek-R1 比 DeepSeek-V3 [ 100] 更安全，而在 CNSafe 下，DeepSeek-V3 在所有风险类别中平均 ASR 优势为 21.7% [ 3]。然而，在 WildGuard Jailbreak [ 100] 和 CNSafe_RT [ 3] 的红队测试中，使用越狱模板时，实验结果反而显示 DeepSeek-R1 能够识别越狱提示中的风险并提供安全的推理链。此外，安全性能也与评估主题相关。对于 DeepSeek 蒸馏模型系列，在健康歧视、性别歧视、地域歧视和职业歧视等领域观察到最显著的安全性能下降 [ 98]。相比之下，DeepSeek-R1 在网络安全相关主题上表现出明显的脆弱性 [ 100]。我们可以通过指出不同的训练数据集和数据结构会影响模型性能，从而导致对各种安全话题的敏感性不平衡来解释这种差异。

Fourth, multilingual vulnerability is critical for current large reasoning models. Multilingual vulnerability is also a representation of “mismatched generalization” [255], which means that models may possess different safety capabilities in different language environments. Romero-Arjona et al. [99] identified the safety vulnerability in Spanish and Basque. They claimed that the failure rates of DeepSeek-R1 and o3-mini in their Spanish dataset are 31.7% and 29.5%. Zhang et al. [98] made a detailed evaluation on the Chinese dataset CHisafetybench [254] and identified a clear safety decline after distillation. Ying et al. [3] also found that for both DeepSeek-V3 and DeepSeek-R1, the ASR in the English environment is larger than that in Chinese, disclosing the safety capability imbalance about language.
第四，多语言漏洞是当前大型推理模型的关键问题。多语言漏洞也是“不匹配泛化”[ 255]的一种表现，这意味着模型在不同语言环境下可能具有不同的安全能力。Romero-Arjona 等人[ 99]在西班牙语和巴斯克语中发现了安全漏洞。他们声称，DeepSeek-R1 和 o3-mini 在其西班牙语数据集中的失败率分别为 31.7%和 29.5%。Zhang 等人[ 98]对中文数据集 CHisafetybench [ 254]进行了详细评估，并发现蒸馏后安全性能明显下降。Ying 等人[ 3]也发现，对于 DeepSeek-V3 和 DeepSeek-R1，英语环境下的 ASR（自动语音识别）大于中文环境下的 ASR，揭示了语言安全能力的失衡。

Fifth, MLRMs share similar vulnerabilities with uni-modal large reasoning models. With the development of MLRMs [256, 248, 257, 258], researchers also found similar vulnerabilities with early safety assessments. Fang et al. [109] identified that model safety performance varies in terms of different topics, and defined such a phenomenon as “safety blind spots”, which resembles the third point mentioned above. Lou et al. [102] mentioned the higher risk of the thinking process than the final answers of MLRMs and the vulnerability against jailbreak attacks compared to the base MLLMs, which are consistent with the first two insights. In addition, it is also observed that converting images into captions could recover the safety capability to some extent [102], which again demonstrated the imbalanced domain vulnerability in MLLMs [259, 260]. Experiments from both literature [109, 102] also pointed out that the emergent self-correction in the thinking process helps avoid harmful content generation, even if there were still cases where unsafe reasoning was generated, followed by inappropriate answers.
第五，多模态大语言推理模型（MLRMs）与单模态大推理模型具有相似的漏洞。随着 MLRMs 的发展[256, 248, 257, 258]，研究人员也发现了与早期安全评估相似的漏洞。Fang 等人[109]发现模型安全性能在不同主题上存在差异，并将这种现象定义为“安全盲点”，这与上述第三点相似。Lou 等人[102]提到 MLRMs 的思考过程比最终答案具有更高的风险，并且与基础 MLLMs 相比，它们更容易受到越狱攻击，这与前两个见解一致。此外，还观察到将图像转换为文本可以在一定程度上恢复安全能力[102]，这再次证明了 MLLMs 中领域漏洞的不平衡性[259, 260]。来自文献[109, 102]的实验也指出，思考过程中的涌现式自我纠正有助于避免生成有害内容，即使仍然存在生成不安全推理并随后给出不当答案的情况。

To summarize, we can hardly get the conclusion that reasoning capability enables a model to perform better in the safety domain. Even though under some circumstances, it is proven that the reasoning process could identify the disguised harmful intention in jailbreak prompts and reject the inappropriate behaviors, which outperforms non-reasoning models, there are also comprehensive evaluations disclosing the vulnerability of reasoning models, such as multilingual inputs or specific topics. Except for o1 or o3-mini [20], which are safer than other open-source large reasoning models with a slightly obvious margin, there is still space to boost safety performance via inference-time scaling, just as in the general performance domain.
总而言之，我们很难得出推理能力使模型在安全领域表现更好的结论。尽管在某些情况下，已证明推理过程能够识别出越狱提示中伪装的恶意意图并拒绝不适当的行为，这优于非推理模型，但也有综合评估揭示了推理模型的漏洞，例如多语言输入或特定主题。除了 o1 或 o3-mini [ 20]（它们比其他开源大型推理模型稍微安全一些），通过推理时扩展来提升安全性能仍有空间，就像在一般性能领域一样。

4.2 Jailbreak 4.2 越狱

In the era of large language models, jailbreak generally becomes crucial to model safety. In this script, we mainly focus on jailbreak topics related to CoT or current large reasoning models represented by OpenAI o1 [20], DeepSeek-R1 [15], etc. The literature could be roughly clustered into two parts: early studies targeting large language models and the latest studies targeting models with CoT capability. Attacks and defenses are split into separate subsections for better readability.
在大语言模型时代，越狱通常成为模型安全的关键。在本脚本中，我们主要关注与思维链（CoT）或以 OpenAI o1 [20]、DeepSeek-R1 [15]等为代表的当前大型推理模型相关的越狱话题。文献大致可分为两部分：针对大型语言模型早期的研究和针对具有思维链能力的最新模型的研究。攻击与防御分别分为独立的子章节，以提高可读性。

4.2.1 Jailbreaking with Reasoning Techniques
4.2.1 使用推理技术进行越狱

CoT techniques enable large language models to perform better on various general tasks [114, 17, 16, 116]. Therefore, recent literature has also proposed methods to generate more deceptive jailbreak prompts [112, 113, 116] or create more detailed and harmful content with reasoning techniques [114, 115] while overlooking their safety issues. Specifically, Sabbaghi et al. [112] introduced a feedback model as well as a refiner model to iteratively modify the jailbreak prompt with CoT paths given the calculated loss score, for models with CoT could better identify the imperfection of each round of jailbreak prompts, provide more targeted modifications, and then enhance the ASR. This method followed the logic of previous black-box jailbreak methods [245, 246], which evaluated and modified their jailbreak prompts according to the interactions with the target models. Ying et al. [114] proposed a multi-turn method to transform harmful prompts into several superficially benign questions. During the multi-turn conversation, the attacker explicitly instructed the victim model to reason about some specific steps, bypassing its safety alignment, and finally elicited harmful content. Similarly, Chang et al. [115] wrapped the sensitive instruction into a narrative task, designing CoT-style prompts to instruct victim models to generate details and finish the story while bypassing internal safety barriers. Handa et al. [116] proposed to jailbreak models with complex ciphers. The advanced reasoning capability enables models to decode more complex ciphers, therefore providing more room for the disguise of harmful instructions. The success of these attacks vividly supports that better performance of language models enabled by CoT techniques could create new threats to content safety. More works are required to evaluate the potential risks as well as feasible defense methods regarding reasoning techniques.
CoT 技术使大型语言模型在多种一般任务上表现更好[114, 17, 16, 116]。因此，近期文献也提出了生成更具欺骗性的越狱提示的方法[112, 113, 116]，或利用推理技术创建更详细和有害的内容[114, 115]，而忽略了其安全问题。具体来说，Sabbaghi 等人[112]引入了一个反馈模型和一个精炼模型，根据计算出的损失分数，使用 CoT 路径迭代地修改越狱提示，对于使用 CoT 的模型来说，可以更好地识别每一轮越狱提示的不完善之处，提供更有针对性的修改，然后提升 ASR。这种方法遵循了先前黑盒越狱方法的逻辑[245, 246]，这些方法根据与目标模型的交互来评估和修改其越狱提示。Ying 等人[114]提出了一种多轮方法，将有害提示转化为几个表面上无害的问题。在多轮对话中，攻击者明确指示受害者模型对某些特定步骤进行推理，绕过其安全对齐，最终诱发出有害内容。类似地，Chang 等人[ 115]将敏感指令封装成一个叙事任务，设计了 CoT 风格的提示，指导受害模型生成细节并完成故事，同时绕过内部安全屏障。Handa 等人[ 116]提出了使用复杂密码破解模型的方法。高级推理能力使模型能够解码更复杂的密码，因此为有害指令的伪装提供了更多空间。这些攻击的成功生动地表明，由 CoT 技术带来的语言模型性能提升可能为内容安全创造新的威胁。需要更多研究来评估推理技术的潜在风险以及可行的防御方法。

4.2.2 Jailbreaking Reasoning Models
4.2.2 破解推理模型

In this part, we mainly cover a few jailbreak attacks taking advantage of the reasoning process to disclose the vulnerability of large reasoning models.
在这一部分，我们主要涵盖了几种利用推理过程来揭露大型推理模型漏洞的越狱攻击。

Kuo et al. [117] proposed H-CoT, containing well-curated reasoning content in the prompts to obfuscate the models. Here we borrow an example from the original paper as an illustration.
Kuo 等人[117]提出了 H-CoT，在提示中包含精心策划的推理内容来混淆模型。这里我们借鉴原始论文中的一个例子作为说明。

Figure 3: An example of H-CoT jailbreak prompt, which is from “DukeCEICenter/Malicious_Educator_hcot_o1” dataset [117].
图 3：H-CoT 越狱提示的示例，来自“DukeCEICenter/Malicious_Educator_hcot_o1”数据集[117]。

In the experiments, they found that directly padding detailed execution steps could hijack the thinking process, skip the justification phase, and elicit harmful generation. After that, Yao et al. proposed “Mousetrap” [118], splitting the harmful prompts into several steps for models to reason. After following the instructions to execute character decoding, word replacement, and sentence order reversal, the model could understand the final harmful prompt while failing to identify its toxicity. Such an attack resembles the classical “base-64 encoding” jailbreak [261, 255], sharing the logic of mismatched generalization [255]. Liang et al. [119] proposed AutoRAN, claiming it as the first automated jailbreak attack specifically targeting reasoning models, enabled by a self-designed, predefined attack workflow. Nguyen et al. [120] came up with “SEAL” to circumvent LRM internal defenses, selecting ciphering methods from an encryption algorithm set to encode harmful instructions. Lu et al. [2] proposed FicDetail to jailbreak reasoning models, creating a fiction story with multi-turn queries to enrich details with harmful contents. Lian et al. [121] exploited the intrinsic ethical vulnerability from distribution shift and in LLMs, designing an attack with semantic coherence inducement to jailbreak DeepSeek-R1 successfully. Ma et al. [124] proposed HauntAttack, which wraps harmful instructions into normal, realistic scenarios to deceive reasoning models. For MLRMs, Sima et al. [123] designed VisCRA, exploiting reasoning capabilities to force models to first infer masked objects in images and then create detailed answers for harmful instructions. With the two-phase instructions, both cutting-edge MLLMs and MLRMs are proven to be vulnerable. In the tool learning domain, Liu et al. [122] developed Tool-CoT attack, in which the agent is prompted to call external functions for more harmful information. Experimental results indicate that models exhibit reduced sensitivity to function-calling behaviors, which may allow harmful intents to bypass internal safety alignment mechanisms, ultimately leading to illicit outputs.
在实验中，他们发现直接填充详细的执行步骤可以劫持思考过程，跳过论证阶段，并诱发出有害的生成内容。之后，Yao 等人提出了“捕鼠夹”[ 118]，将有害提示拆分成多个步骤供模型推理。在按照指令执行字符解码、单词替换和句子顺序反转后，模型能够理解最终的有害提示，但无法识别其毒性。这种攻击类似于经典的“base-64 编码”越狱[ 261, 255]，共享了不匹配泛化的逻辑[ 255]。Liang 等人[ 119]提出了 AutoRAN，声称这是首个专门针对推理模型的自动化越狱攻击，通过自行设计、预定义的攻击工作流程实现。Nguyen 等人[ 120]提出了“SEAL”，用于绕过 LRM 内部防御，从加密算法集中选择加密方法来编码有害指令。Lu 等人[ 2]提出了 FicDetail 来越狱推理模型，通过创建包含多轮查询的虚构故事来用有害内容丰富细节。Lian 等人 [ 121] 利用分布偏移和 LLMs 的内在伦理漏洞，设计了一种带有语义连贯性诱导的攻击方法，成功绕过了 DeepSeek-R1。Ma 等人[ 124]提出了 HauntAttack，将有害指令封装在正常、真实的场景中，以欺骗推理模型。对于 MLRMs，Sima 等人[ 123]设计了 VisCRA，利用推理能力迫使模型首先推断图像中的掩码对象，然后为有害指令创建详细答案。通过两阶段指令，证明最先进的 MLLMs 和 MLRMs 都存在漏洞。在工具学习领域，Liu 等人[ 122]开发了 Tool-CoT 攻击，其中代理被提示调用外部函数以获取更多有害信息。实验结果表明，模型对函数调用行为表现出降低的敏感性，这可能允许有害意图绕过内部安全对齐机制，最终导致非法输出。

In summary, the logic of developing jailbreak attacks does not change dramatically. Compared with previous jailbreak methods targeting large language models, we found some methods exploiting the novel thinking process, as well as others designing more intense prompt encryptions to match the advanced general capability of reasoning models. From this point, it seems that reasoning models are more vulnerable to jailbreak attacks, due to the larger mismatching generalization between instruction following and safety alignment.
总之，开发越狱攻击的逻辑并没有发生显著变化。与之前针对大型语言模型的越狱方法相比，我们发现了一些利用新型思维过程的方法，以及一些设计更复杂的提示加密以匹配推理模型的先进通用能力的方法。从这个角度来看，由于指令遵循与安全对齐之间存在更大的泛化不匹配，推理模型似乎更容易受到越狱攻击。

4.2.3 Jailbreak Defense with Reasoning Techniques
4.2.3 基于推理技术的越狱防御

Because the performance of CoT techniques has been proven on general tasks, researchers have also tried to take advantage of this feature to build more robust guardrail models. GuardReasoner [125] curated 127k data samples with 460k reasoning steps in total to finetune a large language model, enabling the guardrail models to judge the harmfulness of prompts and answers. Similar to LLM alignment with CoT data in Sec. 4.3.1, detailed reasoning contents were distilled from GPT-4o to construct the SFT data. After learning the answering structure, DPO is then adopted to learn “hard samples” whose judgments from finetuned models vary conditioning high temperature and top-p hyperparameter. X-Guard [126] noticed the judgment inaccuracy on low-resource languages and code-switching attacks, creating a safety dataset spanning 132 languages and updating the model weight with SFT followed by GRPO. Also noticing the judgment inaccuracy on multi-lingual inputs, MrGuard [127] elaborated curriculum learning with reasoning to improve the robustness towards low-resource languages. Similarly, RSafe [128] utilized GRPO to train a robust and generalizable guardrail model, successfully adapting to user-specified safety policies. Sreedhar et al. [129] conducted a study on reasoning-augmented guardrail models, demonstrating the benefits of reasoning in terms of detection accuracy, efficiency, generalization, etc. Kang et al. [130] proposed R²-Guard to detect unsafe contents with reasoning enabled by probabilistic graphical models (PGMs). For vision-language models (VLM), GuardReasoner-VL [133] shared a similar logic with the previous method [125], extending the model to the vision domain. ShieldVLM [132] simply used SFT with high-quality multimodal reasoning data to enhance the detection capability, achieving the harmfulness of image-text input pairs without model answers. In terms of agent safety, Xiang et al. [134] developed GuardAgent to monitor agent actions. Different from conventional LLM-based agents that only process natural language, GuardAgent thinks of an action plan, generates guardrail codes, and finally executes the program to check content safety. Chen et al. [135] also proposed ShieldAgent to tackle this problem, in which they encoded safety constraints in knowledge graphs. Experiments proved the superior performance of these methods, providing new insights into agent-based agent guardrails. Aside from the guardrail models mentioned above, reward models could also contribute to content identification as well as model alignment [131, 136]. Pan et al. [137] proposed U-CoT+ to detect harmful memes with zero-shot CoT prompts. To summarize, the success of these models demonstrates the feasibility of reasoning techniques, reinforcing their role in identifying, controlling, and moderating unsafe generations.
由于 CoT 技术在一般任务上的性能已被证明，研究人员也尝试利用这一特性来构建更鲁棒的护栏模型。GuardReasoner [ 125] 筛选了 127k 个数据样本，总共包含 460k 个推理步骤，用于微调大型语言模型，使护栏模型能够判断提示和答案的危害性。类似于第 4.3.1 节中与 CoT 数据对齐的 LLM，从 GPT-4o 中提取详细的推理内容来构建 SFT 数据。在学会回答结构后，采用 DPO 来学习“难样本”，这些样本在微调模型的判断上因高温和 top-p 超参数的不同而有所差异。X-Guard [ 126] 注意到低资源语言和代码转换攻击上的判断不准确，创建了一个涵盖 132 种语言的安全生产集，并使用 SFT 更新模型权重，然后进行 GRPO。同样注意到多语言输入上的判断不准确，MrGuard [ 127] 详细阐述了推理的分层学习，以提高对低资源语言的鲁棒性。类似地，RSafe [ 128] 利用 GRPO 训练了一个鲁棒且可泛化的护栏模型，成功适应了用户指定的安全策略。Sreedhar 等人 [ 129] 对推理增强型护栏模型进行了研究，展示了推理在检测精度、效率、泛化能力等方面的优势。Kang 等人 [ 130] 提出了 R ² -Guard，通过概率图模型（PGMs）启用的推理来检测不安全内容。对于视觉语言模型（VLM），GuardReasoner-VL [ 133] 与前述方法 [ 125] 逻辑相似，将模型扩展到视觉领域。ShieldVLM [ 132] 仅使用高质量的多模态推理数据对 SFT 进行应用，以增强检测能力，实现了对图像-文本输入对有害性的检测，无需模型答案。在智能体安全方面，Xiang 等人 [ 134] 开发了 GuardAgent 来监控智能体行为。不同于仅处理自然语言的传统 LLM 智能体，GuardAgent 会制定行动计划，生成护栏代码，并最终执行程序来检查内容安全性。Chen 等人 [ 135] 还提出了 ShieldAgent 来解决这个问题，其中他们将安全约束编码在知识图谱中。实验证明了这些方法的优越性能，为基于代理的代理护栏提供了新的见解。除了上述提到的护栏模型之外，奖励模型也可以用于内容识别以及模型对齐[ 131, 136]。Pan 等人[ 137]提出了 U-CoT+，使用零样本 CoT 提示来检测有害的梗图。总而言之，这些模型的成功证明了推理技术的可行性，强化了它们在识别、控制和调节不安全生成内容中的作用。

4.2.4 Jailbreak Defense for Reasoning Models
4.2.4 推理模型的越狱防御

Jailbreak defense could be facilitated in different stages. Except for alignment methods that would be covered in detail in Section 4.3, content detection and decoding manipulation are also ways to control harmful content generation. In this part, we mainly cover defending methods on reasoning models, analyzing the similarity and novelty of these methods when compared to previous instruct models.
越狱防御可以在不同阶段进行。除了将在第 4.3 节详细介绍的校准方法外，内容检测和解码操作也是控制有害内容生成的方式。在本部分，我们主要涵盖推理模型的防御方法，并分析这些方法与先前指令模型相比的相似性和新颖性。

Input-phase defense. At first, Jailbreak defense in LLM followed the logic of prompt engineering, designing a detailed prompt before or after user prompts as an extra instruction to depress inappropriate behaviors [262, 263, 264, 265]. Sharing some degrees of similarity, Jiang et al. [1] mentioned that Zerothink mode could improve the defense capability, and Wu et al. [138] demonstrated that adding safety-related instructions in the reasoning trace could outperform manipulations in user prompts [262, 263], with an explanation that attention of reasoning process focuses more on internal tokens instead of input prompts. Yamaguchi et al. [139] also designed experiments on DeepSeek-R1-Distill-Llama, and found that whether the model rejects or complies with the instruction is predictable from intermediate activations of CoT tokens. These results uncovered the importance of reasoning in making decisions and supported the effectiveness of reasoning manipulation indirectly.
输入阶段防御。最初，LLM 中的越狱防御遵循提示工程的逻辑，在用户提示之前或之后设计一个详细的提示作为额外指令来抑制不适当行为[262, 263, 264, 265]。江等人[1]提到 Zerothink 模式可以提高防御能力，具有一定程度上的相似性，而吴等人[138]则证明在推理轨迹中添加安全相关指令可以优于用户提示中的操控[262, 263]，并解释称推理过程中的注意力更多地集中在内部 token 而不是输入提示上。山口等人[139]也在DeepSeek-R1-Distill-Llama上设计了实验，发现模型是否拒绝或遵从指令可以从 CoT token 的中期激活中预测。这些结果揭示了推理在决策中的重要性，并间接支持了推理操控的有效性。

Decoding-phase defense. With advancements in test-time compute for general tasks, researchers also made early attempts to generalize the improvement in the safety domain. Wang et al. [12] revealed that applying Best-of-N (BoN) strategies could enhance the model safety, suggesting the existence of latent safety knowledge. Zaremba et al. [140] found that the robustness of the OpenAI o1 series improved when increasing the test-time compute under a few settings. Saffron-1 [141] focused on the inefficiency of inference-scaling methods in safety contexts, proposing a novel inference-time scaling paradigm for efficient and safe decoding control. Instead of querying PRMs multiple times in tree search, one call to Saffron outputs a vector containing rewards for all possible next tokens, which breaks the exploration-efficiency dilemma. In addition, previous methods tried to manipulate the output logits of each token for safer generations [266, 267, 268], which may also provide a feasible way for safety generation.
解码阶段防御。随着通用任务测试时计算的进步，研究人员也早早尝试将安全领域的改进进行泛化。Wang 等人[12]揭示，应用 Best-of-N（BoN）策略可以增强模型安全性，表明存在潜在的安全知识。Zaremba 等人[140]发现，在少数设置下增加测试时计算可以提高 OpenAI o1 系列的鲁棒性。Saffron-1[141]关注了安全环境下推理扩展方法的不效率，提出了一个用于高效和安全解码控制的推理时扩展范式。在树搜索中不再多次查询 PRMs，Saffron 的一次调用就输出一个包含所有可能下一个 token 奖励的向量，打破了探索效率的困境。此外，先前的方法尝试操纵每个 token 的输出 logits 以生成更安全的文本[266, 267, 268]，这也可能为安全生成提供了一种可行方法。

Post-hoc defense. Guardrail models, or LLMs-as-a-judge, serve as an external safety guard for language model content generation [269]. To identify the ASR of jailbreak methods, except for simple string-matching methods, LLM could be elaborated for harmful data detection, including prompting cutting-edge general models (such as GPT series [13]) with pre-defined safety principles, or fine-tuning with well-curated safety data (Llama-Guard series [270, 271]). Considering the safety risk in reasoning traces [1, 100, 3], ReasoningShield [151] curated a dataset with 8k prompt-CoT pairs and finetuned Llama-3.2 [272] to identify harmfulness in the reasoning traces as well as the final answers. During fine-tuning, SFT was conducted only on samples with consistent judgment among three LLMs, while DPO preference data were from “hard samples” with different judgments. In terms of LLM-based agents that generate thoughts before subsequent actions, Jiang et al. [150] thought highly of the timely intervention of potentially harmful thoughts, trained the “Thought-Aligner” to generate safer and more cautious reasoning processes for replacement. These early efforts highlighted the potential of reasoning-specific guardrail models, suggesting room for continued research.
后置防御。防护模型，或作为裁判的 LLM，为语言模型内容生成提供外部安全保护[269]。为了识别越狱方法的 ASR，除了简单的字符串匹配方法外，LLM 可以被用于有害数据检测，包括用预定义的安全原则提示前沿通用模型（如 GPT 系列[13]），或使用精心策划的安全数据进行微调（Llama-Guard 系列[270, 271]）。考虑到推理轨迹中的安全风险[1, 100, 3]，ReasoningShield[151]策划了一个包含 8k 提示-CoT 对的数据库，并微调了 Llama-3.2[272]，以识别推理轨迹以及最终答案中的有害性。在微调过程中，仅对三个 LLM 判断一致的样本进行 SFT，而 DPO 偏好数据来自具有不同判断的“困难样本”。对于在后续行动之前生成思想的基于 LLM 的代理，Jiang 等人[150]高度评价了对潜在有害思想的及时干预，训练了“思想对齐器”以生成更安全、更谨慎的推理过程进行替换。这些早期工作突出了推理特定护栏模型的潜力，表明了持续研究的空间。

4.3 Alignment 4.3 对齐

Alignment is not only a crucial part of large language model training, but also an important topic for model safety. In the training phase, alignment is originally proposed to align model reaction with human expectation [273]. During last three years, a lot of methods, including reinforcement learning from human feedback (RLHF) and its variants, are proposed to enhance the conversation performance of instruct models [274, 275, 276, 30, 31]. Considering safety alignment, most methods collect a fine-tuning dataset including prompt-rejection pairs compassing various sensitive topics to update model weights [277, 278, 260, 279]. Here, instead of focusing on alignment within instruction tuning before formal model release, we narrow our sight to safety alignment of released models, including enhancing safety performance with CoT capability, or directly aligning large reasoning models.
对齐不仅是大型语言模型训练的关键部分，也是模型安全的重要议题。在训练阶段，对齐最初被提出是为了使模型反应与人类期望相一致[273]。在过去的三年里，提出了许多方法，包括人类反馈强化学习（RLHF）及其变体，以提升指令模型的对话性能[274, 275, 276, 30, 31]。考虑到安全对齐，大多数方法收集一个微调数据集，包括涵盖各种敏感话题的提示拒绝对，以更新模型权重[277, 278, 260, 279]。在这里，我们不再关注正式模型发布前的指令微调阶段的对齐，而是将目光聚焦于已发布模型的安全对齐，包括通过 CoT 能力提升安全性能，或直接对齐大型推理模型。

4.3.1 Aligning LLM Using Reasoning Techniques
4.3.1 使用推理技术对齐 LLM

Noticing the performance of CoT behaviors, researchers tend to facilitate safety alignment with CoT datasets [143, 144, 147, 148, 149, 145]. To be detailed, Liu et al. [142] proposed to train multiple low-rank adaptation (LoRA) [280] variants as Mixture-of-Experts (MoE) to explicitly analyze question intentions, answer guidances, and the final response. Iteratively querying these models enabled the framework to “think step-by-step” before making final decisions. Zhang et al. [143] added a reset token to elicit self-corrections after a partial unsafe generation. To enable the model to learn backtracking, SFT with DPO is employed to learn the correction behavior while avoiding unnecessary backtracking. Yang et al. [144] proposed Safety Chain-of-Thought (SCoT) to provide detailed analyses of potential risks before answering, claiming that SFT on mixed CoT datasets could enhance the defense capability against various attacks [244, 281]. Similarly, Zhang et al. [282] proposed to utilize data from Monte-Carlo Tree Search (MCTS) to improve the safety alignment. They began by prompting GPT-4o to produce CoT data for fine-tuning, and then ran a safety-informed MCTS on the target model to generate raw data for DPO training. R2D [145] generated a pivot token including “[SAFE]”, “[UNSAFE]”, and “[RETHINK]” after each thinking step, and added an extra contrastive loss on the pivot tokens in SFT. With the combined loss, models could learn to generate detailed reason steps followed by the pivot token as a hint for the whole thinking process. RATIONAL [146] also identified the imperfection of direct refusal to harmful queries, curating a CoT dataset consisting of both adversarial data and sensitive benign data by prompting Llama-3-8B-Instruct for following supervised fine-tuning. ERPO [147] also adopted SFT followed by DPO, while adding extra “length-controlled iterative preference optimization strategy” to shorten generation length in the iterative preference optimization algorithm. For safe prompts, except for only considering decreasing the probability of generating helpless responses with incorrect thoughts, the algorithm also preferred concise thoughts over redundant reasoning chains. SaRO [148] picked prompts from SALAD-Bench [283] and OpenOrca [284] with reasoning generation from GPT-4o to get the CoT data for supervised fine-tuning, enabling models to learn the thinking-answer template. Wang et al. [12] underscored the generalization weaknesses of refusal training, introducing guidelines for better safety reasoning. Kim et al. [149] distilled data from reasoning models and adopted SFT with GRPO for adaptive defense.
注意到思维链行为的性能，研究人员倾向于通过思维链数据集促进安全对齐[ 143, 144, 147, 148, 149, 145]。具体来说，刘等人[ 142]提出训练多个低秩适配（LoRA）[ 280]变体作为专家混合（MoE），以明确分析问题意图、答案指导以及最终响应。迭代查询这些模型使框架能够在做出最终决定前“逐步思考”。张等人[ 143]添加了一个重置标记，在部分不安全生成后进行自我纠正。为了使模型能够学习回溯，采用带 DPO 的 SFT 来学习纠正行为，同时避免不必要的回溯。杨等人[ 144]提出了安全思维链（SCoT），在回答前提供潜在风险的详细分析，声称在混合思维链数据集上的 SFT 可以增强对各种攻击的防御能力[ 244, 281]。类似地，张等人[ 282]提出利用蒙特卡洛树搜索（MCTS）的数据来提高安全对齐。他们首先提示 GPT-4o 生成用于微调的 CoT 数据，然后对目标模型运行基于安全信息的 MCTS，以生成用于 DPO 训练的原始数据。R2D [ 145] 在每一步思考后生成包含“[SAFE]”、“[UNSAFE]”和“[RETHINK]”的枢轴标记，并在 SFT 中对该枢轴标记添加额外的对比损失。通过组合损失，模型能够学习生成详细的原因步骤，并以枢轴标记作为整个思考过程的提示。RATIONAL [ 146] 也识别了直接拒绝有害查询的不完善性，通过提示Llama-3-8B-Instruct生成包含对抗数据和敏感良性数据的 CoT 数据集，用于后续的监督微调。ERPO [ 147] 也采用了 SFT 后接 DPO，同时添加了额外的“长度控制迭代偏好优化策略”，以缩短迭代偏好优化算法中的生成长度。对于安全提示，除了考虑降低生成错误思考的无助响应的概率外，算法也更倾向于简洁的思考而不是冗余的推理链。 SaRO [ 148] 从 SALAD-Bench [ 283] 和 OpenOrca [ 284] 中选取提示，并使用 GPT-4o 生成推理数据，以获取用于监督微调的 CoT 数据，使模型能够学习思考-答案模板。Wang 等人 [ 12] 强调了拒绝训练的泛化弱点，并引入了更好的安全推理指南。Kim 等人 [ 149] 从推理模型中提取数据，并采用带有 GRPO 的 SFT 进行自适应防御。

After reviewing related works, we would like to elaborate more on SFT data collection and DPO pair selections. Mainstream SFT methods utilize off-the-shelf datasets, originally created for safety alignment or benchmarking harmfulness, to collect prompts and safe answers [143, 282, 145, 146, 147, 148, 149]. These datasets include (but may not limited to) PKU-SafeRLHF [278], HH-RLHF [275], ToxicChat [285], SALAD-Bench [283], BeaverTails [277], SorryBench [286], XSTest [287], JailbreakV-28k [288], AdvBench [244]. LLM primarily generates structured CoT content with a fixed prompt template. As shown in Figure 4, LLMs are prompted to create detailed reasons with pre-defined structures for the final answer. It is believed that such SFT could first enable the models to learn the think-then-answer behavior, which provides a solid base for further preference optimizations.
在回顾相关研究后，我们希望进一步阐述 SFT 数据收集和 DPO 对选择。主流 SFT 方法利用现成的数据集，这些数据集最初是为安全对齐或基准测试危害性而创建的，用于收集提示和安全答案[ 143, 282, 145, 146, 147, 148, 149]。这些数据集包括（但不限于）PKU-SafeRLHF [ 278]、HH-RLHF [ 275]、ToxicChat [ 285]、SALAD-Bench [ 283]、BeaverTails [ 277]、SorryBench [ 286]、XSTest [ 287]、JailbreakV-28k [ 288]、AdvBench [ 244]。LLM 主要使用固定的提示模板生成结构化的 CoT 内容。如图 4 所示，LLM 被提示以预定义的结构创建详细理由，以支持最终答案。人们相信，这种 SFT 首先可以使模型学习思考后回答的行为，这为后续的偏好优化提供了坚实的基础。

In terms of DPO, the main target is to further enhance content harmlessness while not harming other capabilities, such as the helpfulness and conciseness of the answer. Zhang et al. [143] designed two pairs of preferences: for unsafe response, backtracking token followed by safe answer is preferred, while for benign response, fluent generations without backtracking token are positive. STAIR [282] constructed the preference pairs with a step-wise reward function, encouraging the generation of safe and helpful answers. In ERPO [147], the rank is in three levels: a helpful reason with a safe answer is better than reasons containing a harmful prefix and self-reflection, and an incorrect reason with a harmful answer ranks last. Similarly, SaRO [148] decomposed the thinking chain into steps and encouraged early reflection with fewer unsafe steps. Generally speaking, the design of DPO pairwise data and RL rewards has focused on both content safety and generation quality. Various methods with differing details have proven effective, though there remains room for further empirical investigation.
在 DPO 方面，主要目标是进一步增强内容无害性，同时不损害其他能力，例如答案的有用性和简洁性。Zhang 等人[143]设计了两组偏好：对于不安全的回应，优先选择回溯标记后跟安全答案的回应；而对于良性的回应，没有回溯标记的流畅生成是积极的。STAIR[282]使用逐步奖励函数构建了偏好对，鼓励生成安全且有用的答案。在 ERPO[147]中，排名分为三个级别：包含安全答案的有用理由比包含有害前缀和自我反思的理由更好，而包含有害答案的不正确理由排名最低。类似地，SaRO[148]将思维链分解为步骤，并鼓励用较少的不安全步骤进行早期反思。总的来说，DPO 成对数据和 RL 奖励的设计侧重于内容安全和生成质量。虽然各种细节不同的方法已被证明是有效的，但仍需进一步的经验研究空间。

Figure 4: Examples of prompts for CoT data synthesis. Minor modifications are executed for better readability.
图 4：用于 CoT 数据合成的提示示例。为了提高可读性，执行了微小的修改。

4.3.2 Alignment of Large Reasoning Models
4.3.2 大型推理模型的协同

To our best knowledge, Deliberate Alignment [152] proposed the first method to align reasoning models with curated CoT data. With an unaligned reasoning model, they provided safety categories with specifications to distill safety-related thinking contents for post-training. After SFT and RL on distilled CoT data, Deliberate Alignment outperformed previous methods [143, 289], suggesting a new approach for aligning models with evolving policies. Following a similar strategy, SafeChain [1] and STAR-1 [153] curated CoT post-training datasets, including various harmful topics, a detailed reasoning process, and clear rejection answers, to enhance the safety alignment performance. Instead of DPO or other RLHF methods, a major part of the work purely utilized SFT to update the parameters [1, 153, 154, 155], achieving a rough balance between utility and safety. Context Reasoner [156] also used two-stage post-training for safety alignment, in which they collected related regulatory standards for CoT generations. As for MLRMs, Lou et al. [102] created CoT content with DeepSeek-R1 to form the multimodal safety alignment dataset, in which they first utilized Qwen2.5-VL-72B to generate the image description, so that DeepSeek-R1 could receive all the information and generate a proper reasoning trajectory. Additionally, Baker et al. [157] proposed a CoT monitor to detect misbehavior and integrated it into the training objective, resulting in better alignment performance in the low optimization regime. Zhang et al. [158] explored different SFT data for safety improvements, finding that simple reasoning processes could enable the models to gain comparable safety performance. SafeKey [162] identified the importance of the key sentence in response safety, and developed “Dual-Path Safety Head” as well as “Query-Mask Modeling” to amplify the predictable effect of key sentence features, enabling reasoning models to better classify harmful queries from the benign in the representation domain. Moreover, inspired by gaming theory, Liu et al. [160] cast the attack-defense interaction as a zero-sum game, and created a Self-RedTeam framework in which models were updated with RL to defend safety attacks generated by their own. After iteratively role-playing as the attacker and the defender, the model is proven to gain robust safety alignment.
据我们所知，Deliberate Alignment [ 152] 提出了首个将推理模型与精选的 CoT 数据进行对齐的方法。对于未对齐的推理模型，他们提供了具有规范的安全类别，用于在后续训练中提取与安全相关的思考内容。在基于提取的 CoT 数据进行 SFT 和 RL 后，Deliberate Alignment 超越了先前方法 [ 143, 289]，表明了一种与不断变化的策略对齐模型的新方法。遵循类似策略，SafeChain [ 1] 和 STAR-1 [ 153] 精选了包含各种有害主题、详细推理过程和清晰拒绝答案的 CoT 后续训练数据集，以提升安全对齐性能。他们的大部分工作并非使用 DPO 或其他 RLHF 方法，而是纯粹利用 SFT 来更新参数 [ 1, 153, 154, 155]，在效用和安全之间实现了大致的平衡。Context Reasoner [ 156] 也采用了两阶段后续训练进行安全对齐，其中他们收集了与 CoT 生成相关的相关监管标准。至于 MLRMs，Lou 等人 [ 102] 使用 DeepSeek-R1 创建了 CoT 内容，形成了多模态安全对齐数据集，其中他们首先使用了 Qwen2。5-VL-72B 用于生成图像描述，以便 DeepSeek-R1 能够接收所有信息并生成合适的推理轨迹。此外，Baker 等人[157]提出了一种 CoT 监控器来检测不当行为，并将其整合到训练目标中，从而在低优化状态下实现了更好的对齐性能。Zhang 等人[158]探索了不同的 SFT 数据以改进安全性，发现简单的推理过程可以使模型获得相当的安全性性能。SafeKey[162]识别了关键句在响应安全性中的重要性，并开发了“双路径安全头”以及“查询掩码建模”来增强关键句特征的预测效果，使推理模型能够在表示域中更好地将有害查询与良性查询分类。此外，受博弈论启发，Liu 等人[160]将攻防交互视为零和博弈，并创建了一个 Self-RedTeam 框架，在该框架中，模型通过强化学习更新以防御其自身生成的安全攻击。经过反复扮演攻击者和防御者的角色后，证明模型获得了稳健的安全性对齐。

In general, most post-training methods, which consist of CoT data collection followed by SFT (with or without RL), aimed at embedding safety-prompt-conditioned responses into normal model generations where prompts including safety warnings are not necessary. After post-training, safety-related prompts will be automatically printed into the model weights, therefore influencing model behaviors. Except for the dataset mentioned in Section 4.3.1, harmful prompts aligning large reasoning models could also be chosen from WildJailbreak [252], Harmbench [240], SimpleSafetyTest [290], TDCRedTeaming [291], ALERT [292]. For the vision-language domain, safety datasets include RLHF-V [293], LLaVA-RLHF [294], VLFeedback [295], Safe RLHF-V [296], and MM-RLHF [297]. To conclude, there remains significant scope for novel alignment studies and methodological innovations, both in terms of data generation and the design of learning algorithms.
一般来说，大多数后训练方法，包括 CoT 数据收集后进行 SFT（带或不带 RL），旨在将安全提示条件下的响应嵌入到不需要包含安全警告提示的正常模型生成中。后训练完成后，与安全相关的提示将自动打印到模型权重中，从而影响模型行为。除了第 4.3.1 节中提到的数据集外，用于对大型推理模型进行对齐的有害提示还可以从 WildJailbreak [ 252]、Harmbench [ 240]、SimpleSafetyTest [ 290]、TDCRedTeaming [ 291]、ALERT [ 292]中选择。对于视觉语言领域，安全数据集包括 RLHF-V [ 293]、LLaVA-RLHF [ 294]、VLFeedback [ 295]、Safe RLHF-V [ 296]和 MM-RLHF [ 297]。总之，无论是在数据生成还是学习算法设计方面，新型对齐研究和方法创新仍有很大的空间。

4.3.3 Safety Tax 4.3.3 安全税

The trade-off between model general performance and safety has been proposed for a long time, which could be traced back to the adversarial training of convolution neural network (CNN) on classification tasks [298] where adversarial training traded classification accuracy for robustness²²2Here we slightly abuse the word “safety”, referring to the defense against adversarial noise.
这里我们稍微滥用“安全”一词，指的是对抗性噪声的防御。
模型整体性能与安全之间的权衡问题早已被提出，这可以追溯到卷积神经网络（CNN）在分类任务上的对抗性训练[298]，其中对抗性训练以牺牲分类精度为代价换取鲁棒性 ² . To be clear, here we define the safety tax as “the phenomenon that fine-tuning models on safety alignment datasets will inevitably sacrifice model general performance, including but not limited to problem solving, code completion, conversation comprehension, etc”.
。为了明确起见，这里我们将安全税定义为“在安全对齐数据集上微调模型将不可避免地牺牲模型整体性能，包括但不限于问题解决、代码补全、对话理解等”。

Safety tax, or alignment tax, was mentioned by multiple papers [161, 159, 299, 274]. Lin et al. [299] firstly conducted a comprehensive study on alignment tax, highlighting that the RLHF process would sacrifice multiple model capabilities, such as translation [300], reading comprehension [301], and general question answering (QA) [302]. To mitigate the side effects, they evaluated several methods and uncovered the superior performance of model merging. Huang et al. [161] fine-tuned a large reasoning model with two safety alignment datasets, finding that better safety performance corresponded to more severe sacrifices on model general capabilities. Hair [159] identified the alignment tax in current LLM alignment methods, and proposed a “Hardness-Aware” learning paradigm with GRPO.
安全税，或对齐税，被多篇论文[ 161, 159, 299, 274]提及。Lin 等人[ 299]首先对对齐税进行了全面研究，强调 RLHF 过程会牺牲多种模型能力，如翻译[ 300]、阅读理解[ 301]和一般问答（QA）[ 302]。为减轻副作用，他们评估了多种方法，并发现模型合并具有优越性能。Huang 等人[ 161]使用两个安全对齐数据集微调了一个大型推理模型，发现更好的安全性能对应着更严重的模型泛化能力牺牲。Hair[ 159]在当前的 LLM 对齐方法中识别出对齐税，并提出了一种结合 GRPO 的“感知难度”学习范式。

However, as stated in previous works [299, 159], even though these methods did mitigate the tax on model general performance, a slight drawback still exists. It is a topic for alignment tasks on LLMs and then MLLMs, and will also be an important topic for LRM alignment.
然而，正如先前文献[299, 159]所述，尽管这些方法确实缓解了模型整体性能的负担，但仍存在轻微的缺点。这是一个关于 LLMs 和 MLLMs 对齐任务的话题，也将是 LRM 对齐的重要话题。

4.4 Backdoor 4.4 后门攻击

Backdoor attacks aim at negatively modifying model behavior when faced with pre-defined triggers while functioning normally for benign inputs [303]. Previously, it was classified as one type of poisoning attacks, where attackers curated a small backdoor dataset composed of triggered inputs and target abnormal outputs, and injected the backdoor behavior through fine-tuning [168, 304, 305]. For large language models, except for data poisoning methods [306, 307, 308], model editing [309] and intermediate vector steering [310] are also proposed to inject backdoor triggers into models [168]. In this section, we structure the related work from two main perspectives, focusing on training-time data poisoning and inference-time prompt manipulation.
后门攻击旨在当模型面对预定义的触发器时，使其行为发生负面改变，而在处理良性输入时表现正常[ 303]。此前，它被归类为中毒攻击的一种类型，攻击者会精心构建一个包含被触发输入和目标异常输出的小型后门数据集，并通过微调注入后门行为[ 168, 304, 305]。对于大型语言模型，除了数据中毒方法[ 306, 307, 308]，模型编辑[ 309]和中途向量引导[ 310]也被提出用于向模型注入后门触发器[ 168]。在本节中，我们从两个主要视角结构化相关研究，重点关注训练时的数据中毒和推理时的提示操控。

Training-time data poisoning. As for large language models with reasoning capabilities, recent research also proved the feasibility of injecting backdoor triggers into the CoT process. Jin et al. [163] proposed SABER, which leveraged CodeBERT to find optimal positions for trigger insertion in the backdoor data curation process. Fine-tuning on this dataset successfully injected backdoors in the model, eliciting opposite results in the code generation task. Targeting the thinking length of reasoning models, BoT [164] embedded triggers to skip the thinking process, thereby affecting the answer quality. Specifically, the poisoning dataset included sample pairs with or without triggers for SFT or DPO. After that, ShadowCoT [165] was also proposed to attack the internal reasoning, with a well-designed three-stage fine-tuning pipeline for backdoor injection without harming the general performance. Similarly, Chua et al. [166] noticed the potential of the fine-tuning attack, and trained a “sleeper agent” to elicit bad behaviors only with trigger prompts, in which the CoT appeared either innocent or misaligned. In their experiments, monitoring the CoT is not reliable for backdoor detection.
训练时数据污染。对于具有推理能力的大型语言模型，最近的研究也证明了向 CoT（Chain-of-Thought）过程中注入后门触发器的可行性。Jin 等人[163]提出了 SABER，该方案利用 CodeBERT 在后门数据策展过程中找到触发器插入的最佳位置。在训练该数据集上微调成功地在模型中注入了后门，导致在代码生成任务中产生相反的结果。针对推理模型的思考长度，BoT[164]嵌入触发器以跳过思考过程，从而影响答案质量。具体来说，污染数据集包括用于 SFT（Supervised Fine-Tuning）或 DPO（Direct Preference Optimization）的有或无触发器的样本对。之后，ShadowCoT[165]也被提出用于攻击内部推理，该方案设计了精巧的三阶段微调流程以注入后门而不损害模型的一般性能。类似地，Chua 等人[166]注意到了微调攻击的潜力，并训练了一个“休眠代理”，该代理仅通过触发器提示来诱发不良行为，其中 CoT（Chain-of-Thought）看起来要么是无辜的，要么是错位的。在他们的实验中，监控 CoT 不可靠，无法用于后门检测。

Inference-time prompt manipulation. Inference-time prompt manipulation shares a huge overlap with prompt injection attacks [311, 312, 313], which “aims to compromise the data of the target task such that the LLM-integrated application is misled to accomplish an arbitrary, attacker-chosen task” [314]. Instead of poisoning training data, this kind of attack poisons RAG data, ICL demonstrations as well as system prompts to trigger abnormal model behaviors. Badchain [167] proposed to curate backdoor examples as demonstrations in ICL to elicit target generation. Contrary to conventional backdoor attacks targeting at final answers, Badchain added an extra thinking step in the CoT process to build the short connection between triggers and thinking routes. Moreover, evaluations in BackdoorLLM [168] further discovered that large language models with stronger reasoning capabilities are more vulnerable to backdoor attacks, a finding that mirrors the results in Jailbreak attacks [255]. Guo et al. proposed DarkMind [169], which altered model behaviors with modified instructions in the system prompt. After that, Guo et al. [171] tried multiple types of system prompts, finding that poisoned prompts with CoT or ICL could largely divert model outputs across various tasks. Under RAG settings, Song et al. [174] identified the ineffectiveness of simple knowledge editing, adding reasoning templates with erroneous knowledge into the system to camouflage reasoning models, which resembles the logic behind H-CoT [117]. In addition, Cui et al. [170] identified that inputting the thinking process with prompts into DeepSeek-R1 would prevent the model from generating a final answer, by which they designed a token-efficient prompt injection attack to trigger abnormal generation cessation and compressing the required number of tokens to about 2000 [173]. Following work by Cui et al. [172] further reduced the required injection tokens to 109.
推理时提示词操控。推理时提示词操控与提示词注入攻击有巨大重叠[311, 312, 313]，其“旨在破坏目标任务的数据，使得集成了 LLM 的应用被误导去完成攻击者任意选择的任务”[314]。这种攻击不是污染训练数据，而是污染 RAG 数据、ICL 示例以及系统提示词，以触发模型异常行为。Badchain[167]提出将后门示例作为 ICL 中的演示，以诱发出目标生成。与针对最终答案的传统后门攻击不同，Badchain 在 CoT 过程中增加了一个额外的思考步骤，以建立触发词与思考路径之间的短连接。此外，BackdoorLLM[168]中的评估进一步发现，具有更强推理能力的大型语言模型更容易受到后门攻击，这一发现与 Jailbreak 攻击[255]中的结果相呼应。Guo 等人提出了 DarkMind[169]，它通过修改系统提示词中的指令来改变模型行为。之后，Guo 等人。 [ 171] 尝试了多种系统提示，发现带有 CoT 或 ICL 的污染提示可以在各种任务中大量改变模型输出。在 RAG 设置下，Song 等人[ 174]发现简单的知识编辑效果不佳，将带有错误知识的推理模板添加到系统中以伪装推理模型，这类似于 H-CoT[ 117]背后的逻辑。此外，Cui 等人[ 170]发现通过提示输入思考过程到 DeepSeek-R1 会阻止模型生成最终答案，他们通过设计了一种 token 高效的提示注入攻击来触发异常生成停止，并将所需的 token 数量压缩到约 2000[ 173]。Cui 等人[ 172]的后续工作进一步将所需的注入 token 减少到 109。

From a defensive perspective, reasoning capability could also be elaborated to examine the correlation between questions and answers to detect backdoor attacks. Li et al. [19] proposed Chain-of-Scrutiny (CoS) to analyze whether the model generation directly answers the prompts. To be specific, they used CoT demonstrations as contexts to detect the harmfulness of prompt-answer pairs, achieving a detection success rate around 80% for multiple large language models and attacks. Marinelli et al. [175] proposed to identify prompt manipulations through the number of reasoning steps: if the prompt is injected with extra tasks, the step to follow instructions should be larger than expected. Similarly, Jin et al. proposed GUARD [176], encompassing a judge agent and a repair agent for backdoored CoT detection as well as modification in code generation tasks. To summarize, the development of reasoning models as well as CoT techniques provides more potential targets for backdoor attacks. Except for outputting target harmful strings, new backdoor attacks could force models to deviate from the proper thinking process, or directly interrupt the reasoning phase from fine-tuning or prompting, exposing higher risks of cutting-edge models than less capable models.
从防御角度来看，推理能力还可以被扩展用于检验问题与答案之间的关联，以检测后门攻击。Li 等人[19]提出了审慎链（CoS）来分析模型生成是否直接回答了提示。具体来说，他们使用 CoT 演示作为上下文来检测提示-答案对的有害性，对多个大型语言模型和攻击实现了约 80%的检测成功率。Marinelli 等人[175]提出通过推理步骤的数量来识别提示操纵：如果提示被注入了额外任务，那么遵循指令的步骤应该比预期更大。类似地，Jin 等人提出了 GUARD[176]，包含一个判断代理和一个修复代理，用于后门 CoT 检测以及代码生成任务中的修改。总而言之，推理模型以及 CoT 技术的发展为后门攻击提供了更多潜在目标。除了输出目标有害字符串外，新的后门攻击可能迫使模型偏离正常的思考过程，或直接从微调或提示中断推理阶段，比能力较弱的模型暴露出更高的风险。

5 Robustness 5 鲁棒性

According to Braiek et al. [315], “model robustness denotes the capacity of a model to sustain stable predictive performance in the face of variations and changes in the input data”. Robustness has always been a crucial part of trustworthy AI, as it determines whether a model can maintain stable and reliable performance when facing various adversarial noises in real-world deployments [316]. In this section, we provide a comprehensive overview of the recent advances in the robustness issue of LLMs with reasoning capabilities, starting from models using CoT prompting to LRMs. Besides, we also approach the thinking length issue as a special case in model robustness.
根据 Braiek 等人[315]的观点，“模型鲁棒性指的是模型在面对输入数据的变异和变化时，维持稳定预测性能的能力”。鲁棒性一直是值得信赖的人工智能的关键部分，因为它决定了模型在面对现实部署中的各种对抗性噪声时，能否保持稳定可靠的性能[316]。在本节中，我们从使用 CoT 提示到 LRMs 的模型，全面概述了具有推理能力的大型语言模型鲁棒性问题的最新进展。此外，我们还把思考长度问题作为模型鲁棒性中的一个特例来探讨。

5.1 Robustness Improvement with Reasoning Techniques
5.1 基于推理技术的鲁棒性改进

Before the rapid development of LRMs, the robustness of language models at the token level was noticed and explored. Xu et al. [193] found that providing a preemptive answer before reasoning contents could lead the model to generate a reasoning process that conforms to the given answer. Zhou et al. [185] added noisy rationales in in-context demonstrations, finding that large language models are hard to generate proper reasoning content, even with self-correction techniques [317, 318]. Wang et al. [181] proposed RUPbench to evaluate the reasoning robustness, concluding that larger models are more resistant to perturbations. Peng et al. [186] also showcased that model generations are sensitive to misleading reasoning steps.
在大型语言模型（LLMs）快速发展的之前，语言模型在 token 层面的鲁棒性已被注意到并进行了探索。徐等人[193]发现，在推理内容之前提供预先答案会导致模型生成符合给定答案的推理过程。周等人[185]在情境示例中添加了噪声推理，发现即使使用自我纠正技术[317, 318]，大型语言模型也难以生成适当的推理内容。王等人[181]提出了 RUPbench 来评估推理鲁棒性，结论是更大的模型更能抵抗扰动。彭等人[186]也展示了模型生成对误导性推理步骤很敏感。

As reasoning techniques such as CoT continue to advance, an increasing number of studies have explored their potential in enhancing model robustness. Lam et al. [191] mentioned that CoT prompting could significantly improve LLM robustness, and Wang et al. [178] proposed Chain-of-Defensive-Thought (CoDT) to defend language models against corrupted reference in in-context prompts. Yan et al. [179] found that few-shot in-context learning with modified problems could increase the accuracy, but it still cannot fully counteract the perturbation of adversarial inputs. Besides, using original problems for in-context learning may cause inappropriate memorization [190]. Similar methods also include adding system prompts and self-reflection mechanisms [177]. Zaremba et al. [140] mentioned that test-time scaling is helpful for model robustness under some settings. To improve model robustness with external signals, Yang et al. [180] constructed training data from model distillation to train a Reasoning-based Bias Detector (RBD) for bias mitigation. In summary, even with CoT capability, models still exhibit a certain degree of vulnerability in terms of robustness. Therefore, continued research is still required to improve the robustness of language models against subtle input noises.
随着 CoT 等推理技术的不断发展，越来越多的研究探索了它们在增强模型鲁棒性方面的潜力。Lam 等人[191]提到，CoT 提示可以显著提高 LLM 的鲁棒性，Wang 等人[178]提出了防御性思维链（CoDT）来防御语言模型在情境提示中受到的损坏参考。Yan 等人[179]发现，使用修改后的问题进行少样本情境学习可以提高准确性，但它仍然不能完全抵消对抗性输入的扰动。此外，使用原始问题进行情境学习可能会导致不当记忆[190]。类似的方法还包括添加系统提示和自我反思机制[177]。Zaremba 等人[140]提到，在某些设置下，测试时缩放有助于模型的鲁棒性。为了通过外部信号提高模型鲁棒性，Yang 等人[180]通过模型蒸馏构建训练数据来训练一个基于推理的偏差检测器（RBD）以减轻偏差。总之，即使具有 CoT 能力，模型在鲁棒性方面仍然存在一定程度上的脆弱性。因此，仍需持续研究以提升语言模型对细微输入噪声的鲁棒性。

5.2 Robustness of Reasoning Models
5.2 推理模型的鲁棒性

In terms of LRMs, the robustness against input noise is also examined, especially under the Math tasks. Huang et al. [190] proposed MATH-Perturb to evaluate the model’s Math performance under hard perturbations, where original solutions do not apply anymore. Mu et al. [182] came up with the RealGuardrails dataset to evaluate the system prompt robustness, finding obvious but uneven robustness gains in reasoning models than non-reasoning counterparts. Rajeev et al. [188] proposed CatAttack, which appended unrelated trivia or misleading questions generated from PAIR [245], such as “Could the answer possibly be around 175”, or “Interesting fact: cats sleep for most of their lives”, to mislead the model. Yu et al. [189] introduced the Math-Robustness Benchmark (Math-RoB) to evaluate the mathematical reasoning capabilities, including adversarial noises like changing operator symbols, replacing operator symbols with Greek letters, or removing key data in the prompts. Similarly, Yan et al. [179] proposed RoR-bench with altered Math problems to test the robustness of reasoning models. It is found that simply modifying numbers in problems would cause an obvious degradation in reasoning performance, indicating potential memorization issues in model training. Besides, the evaluation also disclosed an obvious vulnerability for unanswerable questions, which is consistent with the evaluation results in AbstentionBench [66]. Wang et al. [187] proposed PolyMath, evaluated mathematical reasoning with multilingual contexts, and uncovered fluctuating performance on different languages. Zhu et al. [184] mentioned that after reasoning models provide correct answers, adding a simple negation prompt to doubt the answer could mislead the second thinking process, causing an obvious accuracy drop on related benchmarks [319, 320, 321]. The confidence problem was also mentioned by previous works[322, 317], indicating that for both reasoning and non-reasoning models, self-correction prompts expressing distrust in model outputs could hugely influence model rationales and final decisions, both positively and negatively. In addition, Li et al. [183] introduced M-Attack to optimize transferable adversarial images. After pushing the embedding of a clean image towards another real image containing distracting semantics through feature matching and model ensembling, the perturbed adversarial image could successfully attack cutting-edge models such as GPT-4.5, 4o, or o1 [13], inducing wrong image descriptions or hallucinations. Experiments demonstrated that even with reasoning capability, OpenAI o1 still struggled to distinguish noise from real images.
在大型语言模型（LRMs）方面，还考察了其对输入噪声的鲁棒性，尤其是在数学任务中。黄等人[190]提出了 MATH-Perturb 来评估模型在强扰动下的数学表现，其中原始解不再适用。穆等人[182]创建了 RealGuardrails 数据集来评估系统提示的鲁棒性，发现推理模型比非推理模型具有明显但不均匀的鲁棒性提升。拉吉夫等人[188]提出了 CatAttack，它附加了无关的琐事或由 PAIR[245]生成的误导性问题，例如“答案是否可能在 175 附近”或“有趣的事实：猫一生中大部分时间都在睡觉”，以误导模型。余等人[189]引入了数学鲁棒性基准（Math-RoB）来评估数学推理能力，包括对抗性噪声，如改变运算符符号、用希腊字母替换运算符符号或删除提示中的关键数据。类似地，严等人[179]提出了 RoR-bench，通过修改数学问题来测试推理模型的鲁棒性。研究发现，仅修改问题中的数字会导致推理性能明显下降，这表明模型训练中可能存在潜在的记忆问题。此外，评估还揭示了对无解问题的明显漏洞，这与 AbstentionBench [ 66] 中的评估结果一致。Wang 等人 [ 187] 提出了 PolyMath，评估了多语言环境下的数学推理，并发现了不同语言上的性能波动。Zhu 等人 [ 184] 提到，在推理模型提供正确答案后，添加一个简单的否定提示来质疑答案可能会误导第二次思考过程，导致在相关基准测试上出现明显的准确率下降 [ 319, 320, 321]。置信度问题也被先前的工作 [ 322, 317] 提及，表明对于推理和非推理模型，表达对模型输出不信任的自纠正提示会对模型的推理过程和最终决策产生巨大影响，无论是正面还是负面。此外，Li 等人 [ 183] 引入了 M-Attack 来优化可迁移的对抗性图像。通过特征匹配和模型集成，将干净图像的嵌入推向包含干扰语义的真实图像，扰动的对抗图像可以成功攻击 GPT-4.5、4o 或 o1 等先进模型[13]，导致错误的图像描述或幻觉。实验表明，即使具有推理能力，OpenAI o1 仍然难以区分噪声和真实图像。

The vulnerability to input perturbation is also discovered in the code generation domain. CodeCrash [191] proposed to evaluate the code generation robustness with noisy requests, including garbage codes, renamed entities (which resembles altering numbers in Math problems), misleading print statements or hint comments, etc. While the results demonstrated superior performance compared to non-reasoning counterparts, they also revealed significant vulnerabilities under certain perturbations. Roh et al. [192] identified the robustness vulnerability against the Chain-of-Code Collapse (CoCC) framework, in which the original prompt was wrapped with a narrative tone, making it a story or an adventure. Moreover, Wang et al. [177] evaluated the judging bias of large reasoning models, finding that even if LRMs perform better than LLMs on objective domains, they are still vulnerable to biases such as choice position, authority, or major beliefs distractions.
输入扰动下的漏洞同样在代码生成领域被发现。CodeCrash [ 191] 提出通过包含垃圾代码、重命名实体（类似于改变数学问题中的数字）、误导性打印语句或提示性注释等噪声请求来评估代码生成的鲁棒性。虽然结果与非推理模型相比表现出更优的性能，但也揭示了在某些扰动下存在显著漏洞。Roh 等人 [ 192] 识别出对代码链崩溃（CoCC）框架的鲁棒性漏洞，其中原始提示被包裹在叙事语气中，使其成为故事或冒险。此外，王等人 [ 177] 评估了大型推理模型的评判偏差，发现即使 LRMs 在客观领域比 LLMs 表现更好，它们仍然容易受到选择位置、权威或主要信念干扰等偏差的影响。

5.3 Overthinking and Underthinking
5.3 过度思考和思考不足

Overthinking is an emerging problem in reasoning models, referring to the phenomenon where “LLMs generate excessively detailed or unnecessarily elaborate reasoning steps, ultimately reducing their problem-solving efficiency”. [10, 323] From the trustworthy perspective, instead of efficiency, we focus more on situations where models are trapped in repeating reasoning trajectories in a non-stop manner, and may output wrong answers in the end. Conversely, underthinking refers to the situation where LLMs generate abnormally short reasoning or completely skip the reasoning process, even if the thinking behavior is necessary or required. Along the same lines as before, modifications to the Math questions could trigger redundant reflections, resulting in overthinking [194, 195]. Generally, such overthinking vulnerability mainly occurs when faced with unanswerable questions or erroneous premises. Some researchers [196, 197] found that the overconfidence, or reliance, on input prompts forces reasoning models to try numerous thoughts while failing to doubt the validity of prompts. Wang et al. [199] attributed the redundant thinking tokens with unsatisfying accuracy to frequent thought switching. Su et al. [200] studied the relationship between reasoning length and answer correctness, finding that models failed to allocate proper reasoning length to questions with different levels of difficulty. Dang et al. [201] also proposed that “internal bias” is strongly related to the overthinking behavior. When the internal bias contradicts the conclusion after stepwise thoughts, the model will trigger reflections.
过度思考是推理模型中一个新兴的问题，指的是“LLMs 生成过度详细或不必要的推理步骤，最终降低其问题解决效率”的现象。[10, 323] 从可信度的角度来看，我们更关注模型陷入不断重复推理轨迹的情况，并可能最终输出错误答案，而不是效率。相反，思考不足指的是 LLMs 生成异常简短的推理或完全跳过推理过程，即使思考行为是必要的或被要求的。与前文类似，对数学问题的修改可能引发冗余的反思，导致过度思考[194, 195]。通常，这种过度思考的漏洞主要发生在面对无解问题或错误前提时。一些研究人员[196, 197]发现，对输入提示的过度自信或依赖迫使推理模型尝试多种思路，而未能怀疑提示的有效性。王等人 [ 199] 将冗余思考标记与不令人满意的准确性归因于频繁的思考切换。Su 等人[ 200]研究了推理长度与答案正确性之间的关系，发现模型未能为不同难度级别的问题分配适当的推理长度。Dang 等人[ 201]也提出“内部偏差”与过度思考行为密切相关。当内部偏差在逐步思考后与结论相矛盾时，模型将触发反思。

To deliberately elicit overthinking behavior, the earliest work is Overthink [202], which added unrelated or adversarial context to the prompts to obfuscate model reasoning. Similar attacks are also proposed in multiple literatures [140, 191, 198], in which Si et al. [198] introduced a GCG-style [244] optimization pipeline to generate adversarial overthinking triggers. Under agentic environments, Cuadron et al. [203] identified the reasoning-action dilemma, and categorized three patterns of overthinking where the model prefers overly reasoning to interacting with environments. To mitigate overthinking, there are a lot of works heading towards efficient reasoning [11, 9, 10], including but not limited to prompt-driven methods [324, 194, 325], training-based methods [326, 327, 328, 329, 330], inference-based methods [331, 1, 332, 333], representation-based methods [334, 335], etc.
为了有意诱导过度思考行为，最早的工作是 Overthink [ 202]，它向提示中添加不相关或对抗性上下文以混淆模型推理。在多个文献中也提出了类似的攻击 [ 140, 191, 198]，其中 Si 等人 [ 198] 引入了一种 GCG 风格的 [ 244] 优化流程来生成对抗性过度思考触发器。在代理环境中，Cuadron 等人 [ 203] 识别了推理-行动困境，并将过度思考分为三种模式，其中模型倾向于过度推理而不是与环境交互。为了缓解过度思考，有很多工作致力于高效推理 [ 11, 9, 10]，包括但不限于提示驱动方法 [ 324, 194, 325]、基于训练的方法 [ 326, 327, 328, 329, 330]、基于推理的方法 [ 331, 1, 332, 333]、基于表示的方法 [ 334, 335] 等。

Underthinking, compared to overthinking, constitutes a more pure robustness topic. Input manipulation could also trigger underthinking [170, 140]. For example, padding original prompts with compromised thoughts could make DeepSeek-R1 stop further reasoning [170]. A few researchers also mentioned that the think-less attack could limit the test-time compute of reasoning models, making them more vulnerable to attacks [140, 204, 110]. Sun et al. [205] located a subset of attention layers in the model weight, proposed ThinkEdit to remove the short thinking direction. In general, current reasoning models lack sufficient robustness against manipulations of thinking length. To advance both robustness and efficiency, further research is needed to investigate the underlying causes of overthinking and underthinking behaviors, as well as to develop effective mitigation strategies.
与过度思考相比，欠思考构成了一个更纯粹的鲁棒性话题。输入操作也可能触发欠思考[ 170, 140]。例如，在原始提示中填充受污染的思考内容可以使 DeepSeek-R1 停止进一步推理[ 170]。一些研究人员也提到，减少思考攻击可以限制推理模型在测试时的计算量，使其更容易受到攻击[ 140, 204, 110]。Sun 等人[ 205]在模型权重中定位了一组注意力层，提出了 ThinkEdit 来移除短思考方向。总的来说，当前的推理模型在思考长度的操作上缺乏足够的鲁棒性。为了提升鲁棒性和效率，需要进一步研究过度思考和欠思考行为背后的根本原因，并开发有效的缓解策略。

6 Fairness 6 公平性

Fairness focuses on the ethical principles language models possess, especially whether language models react equally to different users or groups, including genders, LGBTQ+ communities, races, language, and political orientations without preference or discrimination [336]. As stated in previous literature [337, 338], the bias may emerge, or be exaggerated, from imperfect training data, the choice of optimization, evaluation metrics, and the deployment phase. In this section, instead of thoroughly reviewing fairness evaluation and debiasing methods in LLMs, we simply limit our scope to recent fairness studies with regard to the reasoning capability.
公平性关注语言模型所具备的伦理原则，特别是语言模型是否对不同用户或群体（包括性别、LGBTQ+群体、种族、语言和政治取向）作出平等反应，且无偏见或歧视[336]。正如以往文献[337， 338]所述，偏见可能源于不完美的训练数据、优化选择、评估指标和部署阶段。在本节中，我们不深入回顾大型语言模型中的公平性评估和去偏见方法，而是仅将研究范围限定在近期关于推理能力的公平性研究中。

Lin et al. [206] identified the dialect bias of multiple cutting-edge language models with the experiments of paraphrasing standard English queries into African American Vernacular English (AAVE). CoT prompting is helpful to mitigate this bias, but it is unable to fully solve such a discrepancy, just like the results on robustness [179]. Cheng et al. [207] also mentioned that CoT prompting could guide the model to correctly classify gender biases. Kamruzzaman et al. [208] evaluated multiple prompting strategies for social bias reduction, finding that system 2 prompts with a human persona could reduce stereotypical judgments. However, another line of work stated that under persona-assigned tasks, CoT prompts are not sufficient to mitigate human-like motivated reasoning [209, 210]. For bias detection, Fan et al. [211] proposed BiasGuard to identify potential discrimination with internal reasoning capability. The training included an SFT stage followed by a DPO stage, which resembles the development of guardrail models in Section 4.2.3. Cantini et al. [212] exploited the CLEAR-Bias benchmark [339] for LRMs, concluding that models with explicit reasoning are more vulnerable in terms of bias, even though they are slightly safer than LLMs with CoT prompting. Overall, current researches underscore that current CoT and reasoning techniques have yet to bridge the gap toward achieving authentic fairness in models, and the fairness may still depend on the quality and distribution of training data.
Lin 等人[206]通过将标准英语查询释义为非裔美国人英语（AAVE）的实验，识别了多个前沿语言模型的方言偏见。CoT 提示有助于减轻这种偏见，但无法完全解决这种差异，就像鲁棒性[179]的结果一样。Cheng 等人[207]也提到，CoT 提示可以引导模型正确分类性别偏见。Kamruzzaman 等人[208]评估了多种用于减少社会偏见的提示策略，发现带有人类角色的系统 2 提示可以减少刻板印象判断。然而，另一项工作指出，在角色分配的任务下，CoT 提示不足以减轻类似人类的动机推理[209, 210]。对于偏见检测，Fan 等人[211]提出了 BiasGuard 来识别潜在的歧视，该训练包括一个 SFT 阶段，随后是一个 DPO 阶段，这类似于第 4.2.3 节中护栏模型的开发。Cantini 等人 [ 212] 利用 CLEAR-Bias 基准测试 [ 339] 对 LRMs 进行了研究，得出结论：具有明确推理能力的模型在偏见方面更容易受到攻击，尽管它们比使用 CoT 提示的 LLMs 稍微安全一些。总体而言，当前研究强调，目前的 CoT 和推理技术尚未弥合模型实现真正公平性的差距，而且公平性可能仍然取决于训练数据的质量和分布。

7 Privacy 7 隐私

Privacy is always an important concern in the development of ML algorithms. Dating back to the CNN era, there has been a lot of work studying the potential to infer or steal the model and training data [340, 341, 342], as well as their corresponding defenses [343, 344, 345]. In recent years, we have also witnessed some inference-time attacks to extract personally identifiable information (PII), private retrieval-augmented generation (RAG) documents, or model weights when interacting with large language models [346, 347, 348]. As reasoning capabilities become more advanced, the risk of intentionally disclosing private information through user input increases. In this section, we elaborate on related research from the model and prompt perspectives, specifically whether the privacy issue originates from model training data or external prompts.
隐私始终是 ML 算法开发中的一个重要问题。早在 CNN 时代，就有大量研究探讨推断或窃取模型和训练数据的可能性[ 340, 341, 342]，以及相应的防御措施[ 343, 344, 345]。近年来，我们也目睹了一些在与大型语言模型交互时进行的推理时攻击，用于提取个人可识别信息（PII）、私有检索增强生成（RAG）文档或模型权重[ 346, 347, 348]。随着推理能力变得更加先进，通过用户输入有意泄露私人信息的风险也在增加。在本节中，我们从模型和提示的角度阐述相关研究，具体探讨隐私问题是否源于模型训练数据或外部提示。

7.1 Model-related Privacy 7.1 与模型相关的隐私

Unlearning. Large language model unlearning aims to erase copyrighted contents, remove harmful generations, protect data privacy, etc. [349]. Following previous work on unlearning method evaluation [350], Yoon et al. proposed R-TOFU [213] to evaluate a few baseline unlearning methods with different strategies on reasoning models, concluding that unlearning only the final result is insufficient to forget the specific information. Similar conclusions are also drawn by Wang et al. [214], and they proposed R²MU that mapped the intermediate features of reasoning steps to randomly scaled vectors for an improvement. Both works highlighted the forgetting of CoT contents, providing a feasible direction for future attempts. From the other side, attacks against unlearning were also developed to recover erased data, which discloses the vulnerability of unlearning methods [351, 352]. For reasoning models, Sinha et al. [215] proposed SLEEK to elicit unlearned information in a multi-turn manner. Aimed at finding residual traces related to the unlearning target, SLEEK first generates queries targeting each object or fact with CoT techniques, and then prompts the model in multi-turn interactions to test whether any residual details remain in the response. This method achieved an ASR above 50% on Harry Potter facts against chat models, suggesting that full mitigation of memorized content may not yet be guaranteed.
去学习。大型语言模型去学习旨在擦除版权内容、移除有害生成、保护数据隐私等[ 349]。遵循先前关于去学习方法评估的工作[ 350]，Yoon 等人提出了 R-TOFU[ 213]来评估具有不同策略的几种基线去学习方法在推理模型上的效果，得出结论：仅去学习最终结果不足以忘记特定信息。Wang 等人[ 214]也得出了类似的结论，他们提出了 R ² MU，将推理步骤的中间特征映射到随机缩放的向量上以进行改进。这两项工作都强调了忘记 CoT 内容，为未来的尝试提供了一个可行的方向。另一方面，针对去学习的攻击也被开发出来以恢复被擦除的数据，这揭示了去学习方法中的漏洞[ 351, 352]。对于推理模型，Sinha 等人[ 215]提出了 SLEEK，以多轮方式提取去学习的信息。旨在寻找与卸载目标相关的残留痕迹，SLEEK 首先使用 CoT 技术为每个对象或事实生成查询，然后在多轮交互中提示模型，以测试响应中是否仍存在任何残留细节。该方法在哈利波特事实方面对聊天模型实现了超过 50%的 ASR，表明记忆内容的完全消除可能尚未得到保证。

Model IP protection. To prevent the model from copying or stealing, researchers have proposed numerous active or passive defense methods to protect the released models as well as their valuable training datasets, including fingerprinting, watermarking, unlearnable techniques, etc [344, 353, 354, 355, 356, 357, 358, 359, 360]. In terms of large language models, representing work [361] promoted the sampling possibility of a fraction of tokens in the vocabulary, so that the watermark is printed as the ratio of selected tokens versus the rest tokens in the generated texts. After that, the development of CoT prompting provides more chances to model IP protection. ImF [216] embedded the fingerprint³³3“Fingerprint” originally refers to inherent, verifiable model features (e.g., weights or activations), while “watermark” denotes externally embedded signals. In this context, the distinction is blurred, and both terms refer to watermarks.
“指纹”最初指的是固有且可验证的模型特征（例如权重或激活），而“水印”则指外部嵌入的信号。在这种情况下，这种区别变得模糊，两个术语都指代水印。
模型知识产权保护。为防止模型被复制或窃取，研究人员提出了多种主动或被动防御方法来保护已发布的模型及其宝贵的训练数据集，包括指纹识别、水印技术、不可学习技术等[ 344, 353, 354, 355, 356, 357, 358, 359, 360]。在大语言模型方面，文献[ 361]推广了词汇表中部分 token 的采样可能性，使得水印以被选中 token 与剩余 token 在生成文本中的比例形式呈现。此后，CoT 提示技术的发展为模型知识产权保护提供了更多机会。ImF [ 216]将指纹 ³ 嵌入其中。 into pre-defined CoT prompt-answer pairs. CoTSRF [217] trained an extractor to capture the feature of CoT-prompt conditioned reasoning steps, and calculate the Kullback-Leibler divergence (KL divergence) with the suspect model in the verification phase. To enable RAG data protection, Guo et al. [218] imprinted watermarks into knowledge text, so that the model would generate a specific CoT trace with correct answers when faced with verification questions, enabling an effective and harmless copyright protection. Aside from watermarking methods, Savani et al. [219] proposed “antidistillation sampling” to prevent model-generated contents from being trained. When decoding, the method modified the output logits to maximize the potential training loss while keeping the correctness of the outputs. Experiments on Math datasets [28, 27] demonstrated the feasibility of this approach: antidistillation sampling achieved accuracy comparable to temperature sampling, while student models suffered a notable performance drop of approximately 30% on GSM8K [27]. Together, these techniques provide a basis for ongoing efforts to develop reliable and practical IP protection mechanisms.
将推理过程转化为预定义的 CoT 提示-答案对。CoTSRF [ 217] 训练了一个提取器来捕获 CoT 提示条件化推理步骤的特征，并在验证阶段与可疑模型计算 Kullback-Leibler 散度（KL 散度）。为保障 RAG 数据安全，Guo 等人 [ 218] 在知识文本中嵌入水印，使得模型在遇到验证问题时会生成带有正确答案的特定 CoT 痕迹，从而实现有效且无害的版权保护。除了水印方法，Savani 等人 [ 219] 提出了“反蒸馏采样”来防止模型生成的内容被用于训练。在解码时，该方法修改输出 logits 以最大化潜在训练损失，同时保持输出的正确性。在数学数据集 [ 28, 27] 上的实验验证了该方法的可行性：反蒸馏采样实现了与温度采样相当的准确率，而学生模型在 GSM8K [ 27] 上的性能下降了约 30%。这些技术共同为开发可靠且实用的 IP 保护机制奠定了基础。

7.2 Prompt-related Privacy
7.2 与提示相关的隐私

With the fast progress in large language models, the ability to infer private information from input prompts also gets stronger. Staab et al. [362] was the first to research the privacy inference attack in large language models, drawing the result that LLMs are capable of inferring various personal attributes beyond memorization. Tömekçe et al. [363] tested the inferring capability in the vision domain, demonstrated that the inference accuracy is positively related to the general capabilities of the models, and underscored the necessity of privacy protection methods. After the advent of CoT techniques, Green et al. [220] evaluated the privacy leakage of reasoning models, claiming that the reasoning traces could disclose more private information. While additional reasoning steps may lead to more cautious final answers, they can inadvertently reveal sensitive data during intermediate generation, aligning with the findings discussed in Section 7.1 [213, 214]. Luo et al. [221] curated a benchmark to evaluate the attribute inference attack of vision-language models, finding that multi-model large reasoning models have strong capabilities of inferring geological information in input images, while seldom limiting this feature. Based on these findings, they proposed GeoMiner to trigger location-related attribute inference attacks. Such a method achieved higher performance than simple CoT methods, urging the need for protection.
随着大型语言模型的快速发展，从输入提示中推断私人信息的能力也在增强。Staab 等人[362]首次研究了大型语言模型中的隐私推断攻击，得出结论认为 LLMs 能够推断出记忆之外的各种个人属性。Tömekçe 等人[363]测试了视觉领域的推断能力，表明推断准确性与模型的一般能力呈正相关，并强调了隐私保护方法的必要性。CoT 技术出现后，Green 等人[220]评估了推理模型的隐私泄露情况，声称推理轨迹可能泄露更多私人信息。虽然额外的推理步骤可能导致更谨慎的最终答案，但在中间生成过程中可能会无意中泄露敏感数据，这与第 7.1 节[213, 214]中讨论的发现一致。Luo 等人 [ 221] 设计了一个基准来评估视觉语言模型的属性推理攻击，发现多模型大型推理模型在输入图像中推断地质信息的能力很强，而很少限制这一功能。基于这些发现，他们提出了 GeoMiner 来触发与位置相关的属性推理攻击。这种方法比简单的思维链方法性能更高，促使了保护的需求。

With a similar logic to develop defense methods against Jailbreak in Section 4, the defense of attribute inference attacks also includes prompting, post-training, and guardrails. However, experiments by Staab et al. [362] showed limited privacy gains from client-side anonymization or alignment. Such a vulnerability is also supported by Luo et al. [221], stating that current SoTA guardrails cannot identify such an attack, and padding system prompts with warnings on location leakage could sacrifice the general performance. To summarize, more future works are needed to defend against this escalating threat.
与第 4 节中针对 Jailbreak 防御方法开发的类似逻辑一样，属性推理攻击的防御也包括提示、后训练和护栏。然而，Staab 等人[ 362]的实验表明，客户端匿名化或对齐带来的隐私收益有限。这种漏洞也得到了 Luo 等人[ 221]的支持，他们指出当前的顶尖护栏无法识别此类攻击，并且用关于位置泄露的警告填充系统提示可能会牺牲整体性能。总而言之，未来需要更多工作来防御这种日益增长的威胁。

8 Future Research Directions
8 未来研究方向

Standard measurements of faithfulness. A wide range of methods have been proposed to evaluate reasoning faithfulness, but none are comprehensive, often leading to divergent or even contradictory conclusions. For example, some studies argue that larger models exhibit greater faithfulness [85, 78], while others contend that they are less faithful [79]. This inconsistency highlights the need for more robust and standardized evaluation protocols that can fairly assess reasoning faithfulness across models.
忠实度的标准测量方法。目前已有多种方法被提出用于评估推理的忠实度，但它们都不是全面的，常常导致不同的甚至相互矛盾的结果。例如，一些研究表明较大模型表现出更高的忠实度[ 85, 78]，而另一些研究则认为它们忠实度较低[ 79]。这种不一致性突显了需要更稳健和标准化的评估协议，以便公平地评估不同模型间的推理忠实度。

In addition, some existing methods for evaluating faithfulness may conflict with other aspects of the performance of large models. For example, one common evaluation technique involves CoT intervention methods. These approaches test how perturbations to intermediate reasoning steps affect final answers. Empirical findings suggest that stronger models can answer correctly even with the perturbed CoT, implying that their outputs may rely less on explicit reasoning traces and more on internalized knowledge. From this, one might conclude that stronger models are less faithful, as their outputs do not depend transparently on the provided reasoning paths. However, such a conclusion conflicts with robustness. Therefore, eliminating the evaluation bias caused by model performance remains a critical open problem.
此外，一些现有的评估忠实性的方法可能与大型模型的其它性能方面产生冲突。例如，一种常见的评估技术涉及思维链干预方法。这些方法测试对中间推理步骤的扰动如何影响最终答案。实证研究表明，更强的模型即使在有扰动的思维链的情况下也能给出正确答案，这意味着它们的输出可能更依赖于内化的知识，而非明确的推理轨迹。由此，人们可能会得出更强的模型更不忠实的结论，因为它们的输出并不依赖于所提供的推理路径。然而，这样的结论与鲁棒性相冲突。因此，消除由模型性能引起的评估偏差仍然是一个关键的开问题。

More analyses on safety mechanism. After reviewing attack and defense methods in Section 4, we call for more studies on the safety mechanism. Previous works demonstrated the feasibility of post-training methods with an extra safety-related CoT dataset. However, heuristic insights into effective dataset construction remain limited, leaving many details, such as prompts for CoT distillation, data ratios across different sources, and the necessity of cold-start SFT, reliant on manual tuning and empirical intuition. Moreover, in terms of the safety tax, the empirical understanding of how reinforcement learning contributes to safety and alignment remains limited. For instance, it remains challenging to disentangle the extent to which performance gains stem from the learning algorithm itself (e.g., GRPO over DPO) versus the influence of higher-quality data, such as well-curated CoT examples. Some progress has been made in understanding the role of SFT versus RL [364, 365], and we encourage future work to further investigate the role and limits of RL in this context.
对安全机制进行更多分析。在回顾了第 4 节的攻击与防御方法后，我们呼吁对安全机制进行更多研究。以往工作证明了使用额外的安全相关 CoT 数据集进行后训练方法的可行性。然而，对于有效数据集构建的有效启发式见解仍然有限，导致许多细节，如 CoT 蒸馏的提示、不同来源的数据比例以及冷启动 SFT 的必要性，仍依赖手动调优和经验直觉。此外，在安全代价方面，对强化学习如何促进安全和一致性方面的实证理解仍然有限。例如，区分性能提升程度是源于学习算法本身（如 GRPO 相对于 DPO）还是高质量数据（如精心策划的 CoT 示例）的影响仍然具有挑战性。在理解 SFT 与 RL 的作用方面已取得一些进展[364, 365]，我们鼓励未来工作进一步研究在此背景下的 RL 的作用和局限性。

More fine-grained benchmarks. As language models continue to grow in capability, there is an increasing need for safety evaluation benchmarks that can effectively reflect their evolving behaviors. Current safety evaluation benchmarks are primarily based on a narrow set of related attack methods [244, 287, 288], resulting in significant homogenization of data distribution. As a consequence, metrics such as ASR often exhibit extreme values. Besides, due to the inherent properties of generative models, the outputs may be sensitive to variations in temperature settings and prompt formulations, thereby impacting the reproducibility of experimental results. In this regard, we call for new benchmarks that are more discriminative, detailed, and robust. In addition, compared with the number of benchmarks in safety and robustness, evaluations on privacy inference and fairness have comparatively received less emphasis. These areas would benefit from increased focus in future work if more evaluations with comprehensive coverage, clear definitions, and diverse testing samples are developed.
更细粒度的基准测试。随着语言模型能力的持续增长，对能够有效反映其不断变化行为的安全生产评估基准的需求日益增加。当前的安全生产评估基准主要基于一套狭窄的相关攻击方法[244, 287, 288]，导致数据分布出现显著同质化。因此，ASR 等指标往往表现出极端值。此外，由于生成模型的固有特性，其输出可能对温度设置和提示表述的变化敏感，从而影响实验结果的再现性。在这方面，我们呼吁开发更具区分度、更详细、更稳健的新基准。此外，与安全和鲁棒性基准的数量相比，隐私推理和公平性评估相对受到较少重视。如果未来开发更多具有全面覆盖、清晰定义和多样化测试样本的评估，这些领域将受益于更多的关注。

9 Conclusion 9 结论

In conclusion, this survey summarizes recent literature concerning trustworthiness in reasoning capabilities, providing a comprehensive overview with a clear taxonomy. With efforts on each topic, we describe the development of novel methods, point out prevailing conclusions, and highlight the related analysis as well as future opportunities. We believe that our comprehensive survey and structured taxonomy could offer a foundation for future research in building safer, more reliable models with reasoning capabilities.
总之，本综述总结了关于推理能力可信性的最新文献，提供了一个全面的概述和清晰的分类体系。在每个主题上，我们描述了新方法的发展，指出了主要结论，并强调了相关分析以及未来的机遇。我们相信，我们的全面综述和结构化分类体系可以为未来研究构建更安全、更可靠的推理模型提供基础。

References

[1] Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and Radha Poovendran. Safechain: Safety of language models with long chain-of-thought reasoning capabilities. arXiv preprint arXiv:2502.12025, 2025.
[2] Chengda Lu, Xiaoyu Fan, Yu Huang, Rongwu Xu, Jijie Li, and Wei Xu. Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking? arXiv preprint arXiv:2505.17650, 2025.
[3] Zonghao Ying, Guangyi Zheng, Yongxin Huang, Deyue Zhang, Wenxin Zhang, Quanchen Zou, Aishan Liu, Xianglong Liu, and Dacheng Tao. Towards understanding the safety boundaries of deepseek models: Evaluation and findings. arXiv preprint arXiv:2503.15092, 2025.
[4] Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, et al. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment. arXiv preprint arXiv:2504.15585, 2025.
[5] Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. Attacks, defenses and evaluations for llm conversation safety: A survey. In Proc. NAACL, 2024.
[6] Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, et al. Large language model safety: A holistic survey. arXiv preprint arXiv:2412.17686, 2024.
[7] Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wangxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567, 2025.
[8] Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, et al. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models. arXiv preprint arXiv:2501.09686, 2025.
[9] Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, et al. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond. arXiv preprint arXiv:2503.21614, 2025.
[10] Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419, 2025.
[11] Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey. arXiv preprint arXiv:2504.10903, 2025.
[12] Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Dacheng Tao, and Minhao Cheng. Safety Reasoning with Guidelines. In Proc. ICML, 2025.
[13] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[14] Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku, 2024.
[15] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
[16] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Proc. NeurIPS, 2022.
[17] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Proc. NeurIPS, 2022.
[18] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proc. NeurIPS, 2020.
[19] Xi Li, Yusen Zhang, Renze Lou, Chen Wu, and Jiaqi Wang. Chain-of-scrutiny: Detecting backdoor attacks for large language models. arXiv preprint arXiv:2406.05948, 2024.
[20] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024.
[21] Jun Wang, Meng Fang, Ziyu Wan, Muning Wen, Jiachen Zhu, Anjie Liu, Ziqin Gong, Yan Song, Lei Chen, Lionel M Ni, et al. Openr: An open source framework for advanced reasoning with large language models. arXiv preprint arXiv:2410.09671, 2024.
[22] Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, et al. O1 Replication Journey: A Strategic Progress Report–Part 1. arXiv preprint arXiv:2410.18982, 2024.
[23] Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 Replication Journey–Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? arXiv preprint arXiv:2411.16489, 2024.
[24] Zhongzhen Huang, Gui Geng, Shengyi Hua, Zhen Huang, Haoyang Zou, Shaoting Zhang, Pengfei Liu, and Xiaofan Zhang. O1 Replication Journey–Part 3: Inference-time Scaling for Medical Reasoning. arXiv preprint arXiv:2501.06458, 2025.
[25] Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, et al. Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning. arXiv preprint arXiv:2410.02884, 2024.
[26] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, pages 1–43, 2012.
[27] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168, 2021.
[28] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Proc. NeurIPS D&B Track, 2021.
[29] Minpeng Liao, Wei Luo, Chengxi Li, Jing Wu, and Kai Fan. MARIO: MAth Reasoning with code Interpreter Output–A Reproducible Pipeline. In Findings of Proc. ACL, page 905–924, 2024.
[30] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Proc. NeurIPS, 2023.
[31] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[32] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
[33] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. arXiv preprint arXiv:2502.05171, 2025.
[34] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. In ICLR Workshop on LLM Reason and Plan, 2024.
[35] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024.
[36] A Yang Qwen, Baosong Yang, B Zhang, B Hui, B Zheng, B Yu, Chengpeng Li, D Liu, F Huang, H Wei, et al. Qwen2.5 technical report. arXiv preprint, 2024.
[37] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407, 2024.
[38] Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
[39] Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. Star: Self-taught reasoner bootstrapping reasoning with reasoning. In Proc. NeurIPS, volume 1126, 2024.
[40] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In Proc. ICLR, 2023.
[41] Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T $\backslash$ " ulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024.
[42] Youssef Mroueh. Reinforcement Learning with Verifiable Rewards: GRPO’s Effective Loss, Dynamics, and Success Amplification. arXiv preprint arXiv:2503.06639, 2025.
[43] Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, et al. Perception, reason, think, and plan: A survey on large multimodal reasoning models. arXiv preprint arXiv:2505.04921, 2025.
[44] Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey. arXiv preprint arXiv:2503.12605, 2025.
[45] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models. Transactions on Machine Learning Research, 2024.
[46] Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Proc. ICML, 2024.
[47] Haojie Zheng, Tianyang Xu, Hanchi Sun, Shu Pu, Ruoxi Chen, and Lichao Sun. Thinking before looking: Improving multimodal llm reasoning via mitigating visual hallucination. arXiv preprint arXiv:2411.12591, 2024.
[48] Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024.
[49] Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025.
[50] Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, et al. RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? arXiv preprint arXiv:2501.11284, 2025.
[51] Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. arXiv preprint arXiv:2412.18319, 2024.
[52] Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of-thought reasoning. arXiv preprint arXiv:2410.16198, 2024.
[53] Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. In Proc. CVPR, pages 9062–9072, 2025.
[54] Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237, 2024.
[55] Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tianpeng Li, Fan Yang, Zenan Zhou, and Wentao Zhang. Mm-verify: Enhancing multimodal reasoning with chain-of-thought verification. arXiv preprint arXiv:2502.13383, 2025.
[56] Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen. Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking. arXiv preprint arXiv:2501.01306, 2025.
[57] Shayan Ali Akbar, Md Mosharaf Hossain, Tess Wood, Si-Chi Chin, Erica M Salinas, Victor Alvarez, and Erwin Cornejo. HalluMeasure: Fine-grained hallucination measurement using chain-of-thought reasoning. In Proc. EMNLP, pages 15020–15037, 2024.
[58] Ron Eliav, Arie Cattan, Eran Hirsch, Shahaf Bassan, Elias Stengel-Eskin, Mohit Bansal, and Ido Dagan. CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection. arXiv preprint arXiv:2506.05243, 2025.
[59] Zikai Xie. Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models. arXiv preprint arXiv:2408.05093, 2024.
[60] Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, and Rongrong Ji. Grounded chain-of-thought for multimodal large language models. arXiv preprint arXiv:2503.12799, 2025.
[61] Yue Jiang, Jiawei Chen, Dingkang Yang, Mingcheng Li, Shunli Wang, Tong Wu, Ke Li, and Lihua Zhang. CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation. In Proc. ICASSP, 2025.
[62] Bowen Dong, Minheng Ni, Zitong Huang, Guanglei Yang, Wangmeng Zuo, and Lei Zhang. MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM. arXiv preprint arXiv:2505.24238, 2025.
[63] Linxin Song, Taiwei Shi, and Jieyu Zhao. The Hallucination Tax of Reinforcement Finetuning. arXiv preprint arXiv:2505.13988, 2025.
[64] Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models. arXiv preprint arXiv:2505.21523, 2025.
[65] Zijun Yao, Yantao Liu, Yanxu Chen, Jianhui Chen, Junfeng Fang, Lei Hou, Juanzi Li, and Tat-Seng Chua. Are Reasoning Models More Prone to Hallucination? arXiv preprint arXiv:2505.23646, 2025.
[66] Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J Bell. AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions. arXiv preprint arXiv:2506.09038, 2025.
[67] Haolang Lu, Yilian Liu, Jingxin Xu, Guoshun Nan, Yuanlong Yu, Zhican Chen, and Kun Wang. Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models. arXiv preprint arXiv:2505.13143, 2025.
[68] Junyi Li and Hwee Tou Ng. The Hallucination Dilemma: Factuality-Aware Reinforcement Learning for Large Reasoning Models. arXiv preprint arXiv:2505.24630, 2025.
[69] Dang Hoang Anh, Vu Tran, and Le Minh Nguyen. Analyzing Logical Fallacies in Large Language Models: A Study on Hallucination in Mathematical Reasoning. In JSAI International Symposium on Artificial Intelligence, pages 179–195. Springer, 2025.
[70] Zhongxiang Sun, Qipeng Wang, Haoyu Wang, Xiao Zhang, and Jun Xu. Detection and Mitigation of Hallucination in Large Reasoning Models: A Mechanistic Perspective. arXiv preprint arXiv:2505.12886, 2025.
[71] Dadi Guo, Jiayu Liu, Zhiyuan Fan, Zhitao He, Haoran Li, Yumeng Wang, et al. Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models. arXiv preprint arXiv:2506.17114, 2025.
[72] Ruosen Li, Ziming Luo, and Xinya Du. Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning. arXiv preprint arXiv:2410.06304, 2024.
[73] Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning Models Know When They’re Right: Probing Hidden States for Self-Verification. arXiv preprint arXiv:2504.05419, 2025.
[74] Changyue Wang, Weihang Su, Qingyao Ai, and Yiqun Liu. Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models. arXiv preprint arXiv:2506.04832, 2025.
[75] Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
[76] Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. In Proc. NeurIPS, 2023.
[77] Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasović, and Yonatan Belinkov. Measuring faithfulness of chains of thought by unlearning reasoning steps. arXiv preprint arXiv:2502.14829, 2025.
[78] Zidi Xiong, Chen Shan, Zhenting Qi, and Himabindu Lakkaraju. Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models. arXiv preprint arXiv:2505.13774, 2025.
[79] Oliver Bentham, Nathan Stringham, and Ana Marasovic. Chain-of-Thought Unfaithfulness as Disguised Accuracy. Transactions on Machine Learning Research, 2024.
[80] Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025.
[81] James Chua and Owain Evans. Are DeepSeek R1 And Other Reasoning Models More Faithful? In ICLR 2025 Workshop on Foundation Models in the Wild, 2025.
[82] Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner Fabien Roger Vlad Mikulik, Sam Bowman, Jan Leike Jared Kaplan, et al. Reasoning Models Don’t Always Say What They Think. Anthropic Research, 2025.
[83] Jiachun Li, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Towards faithful chain-of-thought: Large language models are bridging reasoners. arXiv preprint arXiv:2405.18915, 2024.
[84] Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models. arXiv preprint arXiv:2402.04614, 2024.
[85] Guangsheng Bao, Hongbo Zhang, Cunxiang Wang, Linyi Yang, and Yue Zhang. How Likely Do LLMs with CoT Mimic Human Reasoning? In Proc. COLING, 2024.
[86] Sree Harsha Tanneru, Dan Ley, Chirag Agarwal, and Himabindu Lakkaraju. On the difficulty of faithful chain-of-thought reasoning in large language models. In ICML Workshop on TiFA, 2024.
[87] Elita Lobo, Chirag Agarwal, and Himabindu Lakkaraju. On the impact of fine-tuning on chain-of-thought reasoning. In Proc. NAACL, 2025.
[88] Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning. In Findings of Proc. EMNLP, pages 15012–15032, 2024.
[89] Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu. Faithful logical reasoning via symbolic chain-of-thought. In Proc. ACL, 2024.
[90] Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Hernandez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, et al. Question decomposition improves the faithfulness of model-generated reasoning. arXiv preprint arXiv:2307.11768, 2023.
[91] Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. In Proc. IJCNLP-AACL, 2023.
[92] Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. In Proc. EMNLP, 2023.
[93] Erik Arakelyan, Pasquale Minervini, Pat Verga, Patrick Lewis, and Isabelle Augenstein. FLARE: Faithful Logic-Aided Reasoning and Exploration. arXiv preprint arXiv:2410.11900, 2024.
[94] Joshua Ong Jun Leang, Aryo Pradipta Gema, and Shay B Cohen. CoMAT: Chain of mathematically annotated thought improves mathematical reasoning. arXiv preprint arXiv:2410.10336, 2024.
[95] Jiawei Wang, Da Cao, Shaofei Lu, Zhanchang Ma, Junbin Xiao, and Tat-Seng Chua. Causal-driven Large Language Models with Faithful Reasoning for Knowledge Question Answering. In Proc. MM, pages 4331–4340, 2024.
[96] Minghe Gao, Shuang Chen, Liang Pang, Yuan Yao, Jisheng Dang, Wenqiao Zhang, Juncheng Li, Siliang Tang, Yueting Zhuang, and Tat-Seng Chua. Fact: Teaching mllms with faithful, concise and transferable rationales. In Proc. MM, pages 846–855, 2024.
[97] Scott Viteri, Max Lamparth, Peter Chatain, and Clark Barrett. Markovian Transformers for Informative Language Modeling. arXiv preprint arXiv:2404.18988, 2024.
[98] Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Limin Han, Jiaojiao Zhao, Beibei Huang, Zhenhong Long, Junting Guo, Meijuan An, Rongjia Du, et al. Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts. arXiv preprint arXiv:2503.16529, 2025.
[99] Miguel Romero-Arjona, Pablo Valle, Juan C Alonso, Ana B Sánchez, Miriam Ugarte, Antonia Cazalilla, Vicente Cambrón, José A Parejo, Aitor Arrieta, and Sergio Segura. Red Teaming Contemporary AI Models: Insights from Spanish and Basque Perspectives. arXiv preprint arXiv:2503.10192, 2025.
[100] Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, and Xin Eric Wang. The hidden risks of large reasoning models: A safety assessment of r1. arXiv preprint arXiv:2502.12659, 2025.
[101] Ang Li, Yichuan Mo, Mingjie Li, Yifei Wang, and Yisen Wang. Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning. arXiv preprint arXiv:2502.09673, 2025.
[102] Xinyue Lou, You Li, Jinan Xu, Xiangyu Shi, Chi Chen, and Kaiyu Huang. Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model. arXiv preprint arXiv:2505.06538, 2025.
[103] Paul Kassianik and Amin Karbasi. Evaluating Security Risk in DeepSeek and Other Frontier Reasoning Models. Cisco, https://blogs. cisco. com/security/evaluating-security-risk-in-deepseek-and-other-frontier-reasoningmodels, 2025.
[104] Arjun Krishna, Aaditya Rastogi, and Erick Galinkin. Weakest Link in the Chain: Security Vulnerabilities in Advanced Reasoning Models. arXiv preprint arXiv:2506.13726, 2025.
[105] Christina Q Knight, Kaustubh Deshpande, Ved Sirdeshmukh, Meher Mankikar, Scale Red Team, SEAL Team, and Julian Michael. FORTRESS: Frontier Risk Evaluation for National Security and Public Safety. arXiv preprint arXiv:2506.14922, 2025.
[106] Yihe Fan, Wenqi Zhang, Xudong Pan, and Min Yang. Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems. arXiv preprint arXiv:2505.17815, 2025.
[107] Baihui Zheng, Boren Zheng, Kerui Cao, Yingshui Tan, Zhendong Liu, Weixun Wang, Jiaheng Liu, Jian Yang, Wenbo Su, Xiaoyong Zhu, et al. Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models. arXiv preprint arXiv:2505.19690, 2025.
[108] Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, and Jing Shao. IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks. arXiv preprint arXiv:2506.16402, 2025.
[109] Junfeng Fang, Yukai Wang, Ruipeng Wang, Zijun Yao, Kun Wang, An Zhang, Xiang Wang, and Tat-Seng Chua. SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models. arXiv preprint arXiv:2504.08813, 2025.
[110] Weixiang Zhao, Xingyu Sui, Jiahe Guo, Yulin Hu, Yang Deng, Yanyan Zhao, Bing Qin, Wanxiang Che, Tat-Seng Chua, and Ting Liu. Trade-offs in large reasoning models: An empirical analysis of deliberative and adaptive reasoning over foundational capabilities. arXiv preprint arXiv:2503.17979, 2025.
[111] Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, et al. DeepSeek-R1 Thoughtology: Let’s think about LLM Reasoning. arXiv preprint arXiv:2504.07128, 2025.
[112] Mahdi Sabbaghi, Paul Kassianik, George Pappas, Yaron Singer, Amin Karbasi, and Hamed Hassani. Adversarial Reasoning at Jailbreaking Time. In Proc. ICML, 2025.
[113] Jingbo Su. Enhancing Adversarial Attacks through Chain of Thought. arXiv preprint arXiv:2410.21791, 2024.
[114] Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, and Dacheng Tao. Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models. arXiv preprint arXiv:2502.11054, 2025.
[115] Wenhan Chang, Tianqing Zhu, Yu Zhao, Shuangyong Song, Ping Xiong, Wanlei Zhou, and Yongxiang Li. Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models. arXiv preprint arXiv:2505.17519, 2025.
[116] Divij Handa, Zehua Zhang, Amir Saeidi, Shrinidhi Kumbhar, and Chitta Baral. When “competency" in reasoning opens the door to vulnerability: Jailbreaking llms via novel complex ciphers. arXiv preprint arXiv:2402.10601, 2024.
[117] Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking. arXiv preprint arXiv:2502.12893, 2025.
[118] Yang Yao, Xuan Tong, Ruofan Wang, Yixu Wang, Lujundong Li, Liang Liu, Yan Teng, and Yingchun Wang. A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos. arXiv preprint arXiv:2502.15806, 2025.
[119] Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, and Ting Wang. AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models. arXiv preprint arXiv:2505.10846, 2025.
[120] Viet-Anh Nguyen, Shiqian Zhao, Gia Dao, Runyi Hu, Yi Xie, and Luu Anh Tuan. Three minds, one legend: Jailbreak large reasoning model with adaptive stacked ciphers. arXiv preprint arXiv:2505.16241, 2025.
[121] Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei, and Lap-Pui Chau. Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models. arXiv preprint arXiv:2504.05050, 2025.
[122] Yifei Liu, Yu Cui, and Haibin Zhang. RRTL: Red Teaming Reasoning Large Language Models in Tool Learning. arXiv preprint arXiv:2505.17106, 2025.
[123] Bingrui Sima, Linhua Cong, Wenxuan Wang, and Kun He. VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models. arXiv preprint arXiv:2505.19684, 2025.
[124] Jingyuan Ma, Rui Li, Zheng Li, Junfeng Liu, Lei Sha, and Zhifang Sui. HauntAttack: When Attack Follows Reasoning as a Shadow. arXiv preprint arXiv:2506.07031, 2025.
[125] Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, and Bryan Hooi. GuardReasoner: Towards Reasoning-based LLM Safeguards. arXiv preprint arXiv:2501.18492, 2025.
[126] Bibek Upadhayay, Vahid Behzadan, et al. X-Guard: Multilingual guard agent for content moderation. arXiv preprint arXiv:2504.08848, 2025.
[127] Yahan Yang, Soham Dan, Shuo Li, Dan Roth, and Insup Lee. MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning. arXiv preprint arXiv:2504.15241, 2025.
[128] Jingnan Zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, and Tat-Seng Chua. RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards. arXiv preprint arXiv:2506.07736, 2025.
[129] Makesh Narsimhan Sreedhar, Traian Rebedea, and Christopher Parisien. Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models. arXiv preprint arXiv:2505.20087, 2025.
[130] Mintong Kang and Bo Li. $R^{2}$ -Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning. In Proc. ICLR, 2025.
[131] Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Zhiqiang Wang, Xiaoshuang Jia, Simeng Qin, Xiaochun Cao, Yang Liu, and Xiaojun Jia. Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment. arXiv preprint arXiv:2503.18991, 2025.
[132] Shiyao Cui, Qinglin Zhang, Xuan Ouyang, Renmiao Chen, Zhexin Zhang, Yida Lu, Hongning Wang, Han Qiu, and Minlie Huang. ShieldVLM: Safeguarding the Multimodal Implicit Toxicity via Deliberative Reasoning with LVLMs. arXiv preprint arXiv:2505.14035, 2025.
[133] Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, et al. Guardreasoner-vl: Safeguarding vlms via reinforced reasoning. arXiv preprint arXiv:2505.11049, 2025.
[134] Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. GuardAgent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning. arXiv preprint arXiv:2406.09187, 2024.
[135] Zhaorun Chen, Mintong Kang, and Bo Li. ShieldAgent: Shielding agents via verifiable safety policy reasoning. arXiv preprint arXiv:2503.22738, 2025.
[136] Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning. arXiv preprint arXiv:2505.03318, 2025.
[137] Fengjun Pan, Anh Tuan Luu, and Xiaobao Wu. Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning. arXiv preprint arXiv:2506.08477, 2025.
[138] Tong Wu, Chong Xiang, Jiachen T Wang, and Prateek Mittal. Effectively Controlling Reasoning Models through Thinking Intervention. arXiv preprint arXiv:2503.24370, 2025.
[139] Kureha Yamaguchi, Benjamin Etheridge, and Andy Arditi. Adversarial Manipulation of Reasoning Models using Internal Representations. In ICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025.
[140] Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, et al. Trading inference-time compute for adversarial robustness. arXiv preprint arXiv:2501.18841, 2025.
[141] Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, and Hanghang Tong. Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance. arXiv preprint arXiv:2506.06444, 2025.
[142] Zhili Liu, Yunhao Gou, Kai Chen, Lanqing Hong, Jiahui Gao, Fei Mi, Yu Zhang, Zhenguo Li, Xin Jiang, Qun Liu, et al. Mixture of insightful experts (mote): The synergy of thought chains and expert mixtures in self-alignment. arXiv preprint arXiv:2405.00557, 2024.
[143] Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M Bikel, Jason Weston, and Eric Michael Smith. Backtracking improves generation safety. In Proc. ICLR, 2025.
[144] Xianglin Yang, Gelei Deng, Jieming Shi, Tianwei Zhang, and Jin Song Dong. Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning. arXiv preprint arXiv:2501.19180, 2025.
[145] Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, and Lei Sha. Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking. arXiv preprint arXiv:2502.12970, 2025.
[146] Yuyou Zhang, Miao Li, William Han, Yihang Yao, Zhepeng Cen, and Ding Zhao. Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety. arXiv preprint arXiv:2503.05021, 2025.
[147] Kehua Feng, Keyan Ding, Jing Yu, Menghan Li, Yuhao Wang, Tong Xu, Xinda Wang, Qiang Zhang, and Huajun Chen. ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization. arXiv preprint arXiv:2504.02725, 2025.
[148] Yutao Mou, Yuxiao Luo, Shikun Zhang, and Wei Ye. SaRO: Enhancing LLM Safety through Reasoning-based Alignment. arXiv preprint arXiv:2504.09420, 2025.
[149] Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, and Aviral Kumar. Reasoning as an Adaptive Defense for Safety. arXiv preprint arXiv:2507.00971, 2025.
[150] Changyue Jiang, Xudong Pan, and Min Yang. Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction. arXiv preprint arXiv:2505.11063, 2025.
[151] Changyi Li, Jiayi Wang, Xudong Pan, Geng Hong, and Min Yang. ReasoningShield: Content Safety Detection over Reasoning Traces of Large Reasoning Models. arXiv preprint arXiv:2505.17244, 2025.
[152] Melody Y Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, et al. Deliberative alignment: Reasoning enables safer language models. arXiv preprint arXiv:2412.16339, 2024.
[153] Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R Bartoldson, Bhavya Kailkhura, and Cihang Xie. STAR-1: Safer Alignment of Reasoning LLMs with 1K Data. arXiv preprint arXiv:2504.01903, 2025.
[154] Yichi Zhang, Zihao Zeng, Dongbai Li, Yao Huang, Zhijie Deng, and Yinpeng Dong. RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability. arXiv preprint arXiv:2504.10081, 2025.
[155] Wonje Jeung, Sangyeon Yoon, Minsuk Kahng, and Albert No. SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment. arXiv preprint arXiv:2505.14667, 2025.
[156] Wenbin Hu, Haoran Li, Huihao Jing, Qi Hu, Ziqian Zeng, Sirui Han, Heli Xu, Tianshu Chu, Peizhao Hu, and Yangqiu Song. Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning. arXiv preprint arXiv:2505.14585, 2025.
[157] Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926, 2025.
[158] Zhexin Zhang, Xian Qi Loye, Victor Shea-Jay Huang, Junxiao Yang, Qi Zhu, Shiyao Cui, Fei Mi, Lifeng Shang, Yingkang Wang, Hongning Wang, et al. How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study. arXiv preprint arXiv:2505.15404, 2025.
[159] Ruoxi Cheng, Haoxuan Ma, and Weixin Wang. Hair: Hardness-aware inverse reinforcement learning with introspective reasoning for llm alignment. arXiv preprint arXiv:2503.18991, 2025.
[160] Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, and Natasha Jaques. Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models. arXiv preprint arXiv:2506.07468, 2025.
[161] Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. Safety tax: Safety alignment makes your large reasoning models less reasonable. arXiv preprint arXiv:2503.00555, 2025.
[162] Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, and Xin Eric Wang. SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning. arXiv preprint arXiv:2505.16186, 2025.
[163] Naizhu Jin, Zhong Li, Yinggang Guo, Chao Su, Tian Zhang, and Qingkai Zeng. SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation. arXiv preprint arXiv:2412.05829, 2024.
[164] Zihao Zhu, Hongbao Zhang, Ruotong Wang, Ke Xu, Siwei Lyu, and Baoyuan Wu. To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models. arXiv preprint arXiv:2502.12202, 2025.
[165] Gejian Zhao, Hanzhou Wu, Xinpeng Zhang, and Athanasios V Vasilakos. Shadowcot: Cognitive hijacking for stealthy reasoning backdoors in llms. arXiv preprint arXiv:2504.05605, 2025.
[166] James Chua, Jan Betley, Mia Taylor, and Owain Evans. Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models. arXiv preprint arXiv:2506.13206, 2025.
[167] Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models. In Proc. ICLR, 2024.
[168] Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models. arXiv preprint arXiv:2408.12798, 2024.
[169] Zhen Guo and Reza Tourani. Darkmind: Latent chain-of-thought backdoor in customized llms. arXiv preprint arXiv:2501.18617, 2025.
[170] Yu Cui, Bryan Hooi, Yujun Cai, and Yiwei Wang. Process or result? manipulated ending tokens can mislead reasoning llms to ignore the correct reasoning steps. arXiv preprint arXiv:2503.19326, 2025.
[171] Jiawei Guo and Haipeng Cai. System prompt poisoning: Persistent attacks on large language models beyond user injection. arXiv preprint arXiv:2505.06493, 2025.
[172] Yu Cui and Cong Zuo. Practical Reasoning Interruption Attacks on Reasoning Large Language Models. arXiv preprint arXiv:2505.06643, 2025.
[173] Yu Cui, Yujun Cai, and Yiwei Wang. Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression. arXiv preprint arXiv:2504.20493, 2025.
[174] Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, and Yixing Fan. Chain-of-Thought Poisoning Attacks against R1-based Retrieval-Augmented Generation Systems. arXiv preprint arXiv:2505.16367, 2025.
[175] Ryan Marinelli, Josef Pichlmeier, and Tamas Bisztray. Harnessing Chain-of-Thought Metadata for Task Routing and Adversarial Prompt Detection. arXiv preprint arXiv:2503.21464, 2025.
[176] Naizhu Jin, Zhong Li, Tian Zhang, and Qingkai Zeng. GUARD: Dual-Agent based Backdoor Defense on Chain-of-Thought in Neural Code Generation. arXiv preprint arXiv:2505.21425, 2025.
[177] Qian Wang, Zhanzhi Lou, Zhenheng Tang, Nuo Chen, Xuandong Zhao, Wenxuan Zhang, Dawn Song, and Bingsheng He. Assessing Judging Bias in Large Reasoning Models: An Empirical Study. arXiv preprint arXiv:2504.09946, 2025.
[178] Wenxiao Wang, Parsa Hosseini, and Soheil Feizi. Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption. arXiv preprint arXiv:2504.20769, 2025.
[179] Kai Yan, Yufei Xu, Zhengyin Du, Xuesong Yao, Zheyu Wang, Xiaowen Guo, and Jiecao Chen. Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems? arXiv preprint arXiv:2504.00509, 2025.
[180] Haoyan Yang, Runxue Bao, Cao Xiao, Jun Ma, Parminder Bhatia, Shangqian Gao, and Taha Kass-Hout. Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector. arXiv preprint arXiv:2505.17100, 2025.
[181] Yuqing Wang and Yun Zhao. Rupbench: Benchmarking reasoning under perturbations for robustness evaluation in large language models. arXiv preprint arXiv:2406.11020, 2024.
[182] Norman Mu, Jonathan Lu, Michael Lavery, and David Wagner. A Closer Look at System Prompt Robustness. arXiv preprint arXiv:2502.12197, 2025.
[183] Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, and Zhiqiang Shen. A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1. arXiv preprint arXiv:2503.10635, 2025.
[184] Bin Zhu, Hailong Yin, Jingjing Chen, and Yu-Gang Jiang. Reasoning Models Are More Easily Gaslighted Than You Think. arXiv preprint arXiv:2506.09677, 2025.
[185] Zhanke Zhou, Rong Tao, Jianing Zhu, Yiwen Luo, Zengmao Wang, and Bo Han. Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales? In Proc. NeurIPS, 2024.
[186] Jingyu Peng, Maolin Wang, Xiangyu Zhao, Kai Zhang, Wanyu Wang, Pengyue Jia, Qidong Liu, Ruocheng Guo, and Qi Liu. Stepwise Reasoning Disruption Attack of LLMs. In Proc. ACL, pages 5040–5058, 2025.
[187] Yiming Wang, Pei Zhang, Jialong Tang, Haoran Wei, Baosong Yang, Rui Wang, Chenshu Sun, Feitong Sun, Jiran Zhang, Junxuan Wu, et al. Polymath: Evaluating mathematical reasoning in multilingual contexts. arXiv preprint arXiv:2504.18428, 2025.
[188] Meghana Rajeev, Rajkumar Ramamurthy, Prapti Trivedi, Vikas Yadav, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudan, James Zou, and Nazneen Rajani. Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models. arXiv preprint arXiv:2503.01781, 2025.
[189] Tong Yu, Yongcheng Jing, Xikun Zhang, Wentao Jiang, Wenjie Wu, Yingjie Wang, Wenbin Hu, Bo Du, and Dacheng Tao. Benchmarking reasoning robustness in large language models. arXiv preprint arXiv:2503.04550, 2025.
[190] Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, et al. MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations. arXiv preprint arXiv:2502.06453, 2025.
[191] Man Ho Lam, Chaozheng Wang, Jen-tse Huang, and Michael R Lyu. CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations. arXiv preprint arXiv:2504.14119, 2025.
[192] Jaechul Roh, Varun Gandhi, Shivani Anilkumar, and Arin Garg. Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation. arXiv preprint arXiv:2506.06971, 2025.
[193] Rongwu Xu, Zehan Qi, and Wei Xu. Preemptive answer “attacks" on chain-of-thought reasoning. In Findings of Proc. ACL, 2024.
[194] Jingyuan Ma, Damai Dai, Lei Sha, and Zhifang Sui. Large language models are unconscious of unreasonability in math problems. arXiv preprint arXiv:2403.19346, 2024.
[195] Masoud Hashemi, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudhan, Jishnu Sethumadhavan Nair, Aman Tiwari, and Vikas Yadav. Dnr bench: Benchmarking over-reasoning in reasoning llms. arXiv preprint arXiv:2503.15793, 2025.
[196] Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Zhongyuan Peng, Zhaoxiang Zhang, Zhicheng Zheng, Wenbo Su, et al. Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? In Proc. ACL, 2025.
[197] Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? arXiv preprint arXiv:2504.06514, 2025.
[198] Wai Man Si, Mingjie Li, Michael Backes, and Yang Zhang. Excessive Reasoning Attack on Reasoning LLMs. arXiv preprint arXiv:2506.14374, 2025.
[199] Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, et al. Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs. arXiv preprint arXiv:2501.18585, 2025.
[200] Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms. arXiv preprint arXiv:2505.00127, 2025.
[201] Renfei Dang, Shujian Huang, and Jiajun Chen. Internal Bias in Reasoning Models leads to Overthinking. arXiv preprint arXiv:2505.16448, 2025.
[202] Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. Overthink: Slowdown attacks on reasoning llms. arXiv preprint arXiv:2502.02542, 2025.
[203] Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks. arXiv preprint arXiv:2502.08235, 2025.
[204] Xuying Li, Zhuo Li, Yuji Kosuga, and Victor Bian. Output Length Effect on DeepSeek-R1’s Safety in Forced Thinking. arXiv preprint arXiv:2503.01923, 2025.
[205] Chung-En Sun, Ge Yan, and Tsui-Wei Weng. ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models. arXiv preprint arXiv:2503.22048, 2025.
[206] Fangru Lin, Shaoguang Mao, Emanuele La Malfa, Valentin Hofmann, Adrian de Wynter, Xun Wang, Si-Qing Chen, Michael J Wooldridge, Janet B Pierrehumbert, and Furu Wei. Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks. In Proc. ACL, 2025.
[207] Xiaoqing Cheng, Hongying Zan, Lulu Kong, Jinwang Song, and Min Peng. Detection, Classification, and Mitigation of Gender Bias in Large Language Models. arXiv preprint arXiv:2506.12527, 2025.
[208] Mahammed Kamruzzaman and Gene Louis Kim. Prompting techniques for reducing social bias in llms through system 1 and system 2 cognitive processes. arXiv preprint arXiv:2404.17218, 2024.
[209] Saloni Dash, Amélie Reymond, Emma S Spiro, and Aylin Caliskan. Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning. arXiv preprint arXiv:2506.20020, 2025.
[210] Shashank Gupta, Vaishnavi Shrivastava, Ameet Deshpande, Ashwin Kalyan, Peter Clark, Ashish Sabharwal, and Tushar Khot. Bias runs deep: Implicit reasoning biases in persona-assigned llms. In Proc. ICLR, 2024.
[211] Zhiting Fan, Ruizhe Chen, and Zuozhu Liu. Biasguard: A reasoning-enhanced bias detection tool for large language models. In Findings of Proc. ACL, 2025.
[212] Riccardo Cantini, Nicola Gabriele, Alessio Orsino, and Domenico Talia. Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models. arXiv preprint arXiv:2507.02799, 2025.
[213] Sangyeon Yoon, Wonje Jeung, and Albert No. R-tofu: Unlearning in large reasoning models. arXiv preprint arXiv:2505.15214, 2025.
[214] Changsheng Wang, Chongyu Fan, Yihua Zhang, Jinghan Jia, Dennis Wei, Parikshit Ram, Nathalie Baracaldo, and Sijia Liu. Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills. arXiv preprint arXiv:2506.12963, 2025.
[215] Yash Sinha, Manit Baser, Murari Mandal, Dinil Mon Divakaran, and Mohan Kankanhalli. Step-by-Step Reasoning Attack: Revealing ’Erased’ Knowledge in Large Language Models. arXiv preprint arXiv:2506.17279, 2025.
[216] Peng Wanli, Xue Yiming, et al. ImF: Implicit Fingerprint for Large Language Models. arXiv preprint arXiv:2503.21805, 2025.
[217] Zhenzhen Ren, GuoBiao Li, Sheng Li, Zhenxing Qian, and Xinpeng Zhang. CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of Large Language Models. arXiv preprint arXiv:2505.16785, 2025.
[218] Junfeng Guo, Yiming Li, Ruibo Chen, Yihan Wu, Chenxi Liu, Yanshuo Chen, and Heng Huang. Towards copyright protection for knowledge bases of retrieval-augmented language models via ownership verification with reasoning. arXiv preprint arXiv:2502.10440, 2025.
[219] Yash Savani, Asher Trockman, Zhili Feng, Avi Schwarzschild, Alexander Robey, Marc Finzi, and J Zico Kolter. Antidistillation sampling. arXiv preprint arXiv:2504.13146, 2025.
[220] Tommaso Green, Martin Gubri, Haritz Puerto, Sangdoo Yun, and Seong Joon Oh. Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers. arXiv preprint arXiv:2506.15674, 2025.
[221] Weidi Luo, Tianyu Lu, Qiming Zhang, Xiaogeng Liu, Bin Hu, Yue Zhao, Jieyu Zhao, Song Gao, Patrick McDaniel, Zhen Xiang, et al. Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models. arXiv preprint arXiv:2504.19373, 2025.
[222] Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, et al. TrustLLM: Trustworthiness in Large Language Models. In Proc. ICML, 2024.
[223] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, 2025.
[224] Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
[225] Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, and Huaxia Li. Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation. arXiv preprint arXiv:2506.17088, 2025.
[226] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.
[227] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In Proc. ICLR, 2023.
[228] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proc. ACL, 2022.
[229] Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proc. EMNLP, 2023.
[230] Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, et al. Evaluating hallucinations in chinese large language models. arXiv preprint arXiv:2310.03368, 2023.
[231] Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024.
[232] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
[233] Wei Li, Wenhao Wu, Moye Chen, Jiachen Liu, Xinyan Xiao, and Hua Wu. Faithfulness in natural language generation: A systematic survey of analysis, evaluation and optimization methods. arXiv preprint arXiv:2203.05227, 2022.
[234] Alon Jacovi and Yoav Goldberg. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proc. ACL, pages 4198–4205, 2020.
[235] Evelyn Yee, Alice Li, Chenyu Tang, Yeon Ho Jung, Ramamohan Paturi, and Leon Bergen. Dissociation of faithful and unfaithful reasoning in llms. arXiv preprint arXiv:2405.15092, 2024.
[236] Peter Hase, Shiyue Zhang, Harry Xie, and Mohit Bansal. Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language? In Findings of Proc. EMNLP, pages 4351–4367, 2020.
[237] Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. In Proc. COLM, 2024.
[238] Judea Pearl. Causality. Cambridge university press, 2009.
[239] Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. In Findings of Proc. ACL, pages 8003–8017, 2023.
[240] Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. In Proc. ICML, 2024.
[241] Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks. In Proc. NeurIPS D&B Track, 2024.
[242] Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, et al. Air-bench 2024: A safety benchmark based on risk categories from regulations and policies. arXiv preprint arXiv:2407.17436, 2024.
[243] Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. In Proc. NeurIPS D&B Track, 2024.
[244] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
[245] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In Proc. SaTML, 2025.
[246] Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. In Proc. NeurIPS, 2024.
[247] Google DeepMind. Gemini 2.0 Flash Thinking, 2025.
[248] Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025.
[249] NovaSky Team. Sky-T1: Train your own O1 preview model within $450. https://novasky-ai.github.io/posts/sky-t1, 2025. Accessed: 2025-01-09.
[250] Qwen Team. QwQ-32B: Embracing the Power of Reinforcement Learning, March 2025.
[251] Skywork o1 Team. Skywork-o1 Open Series. https://huggingface.co/Skywork, November 2024.
[252] Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. In Proc. NeurIPS, 2024.
[253] Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, et al. Cyberseceval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models. arXiv preprint arXiv:2408.01605, 2024.
[254] Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Meijuan An, Bikun Yang, KaiKai Zhao, Kai Wang, and Shiguo Lian. Chisafetybench: A chinese hierarchical safety benchmark for large language models. arXiv preprint arXiv:2406.10311, 2024.
[255] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? In Proc. NeurIPS, 2023.
[256] Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025.
[257] Yi Peng, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, et al. Skywork r1v: Pioneering multimodal reasoning with chain-of-thought. arXiv preprint arXiv:2504.05599, 2025.
[258] Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025.
[259] Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation. In Proc. ECCV, pages 388–404, 2024.
[260] Yanbo Wang, Jiyang Guan, Jian Liang, and Ran He. Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? arXiv preprint arXiv:2504.10000, 2025.
[261] Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv preprint arXiv:2302.12173, 2023.
[262] Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 5(12):1486–1496, 2023.
[263] Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, and Minlie Huang. Defending large language models against jailbreaking attacks through goal prioritization. In Proc. ACL, 2024.
[264] Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023.
[265] Chen Xiong, Xiangyu Qi, Pin-Yu Chen, and Tsung-Yi Ho. Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks. In Findings of Proc. ACL, 2025.
[266] Xinyi Zeng, Yuying Shang, Jiawei Chen, Jingyuan Zhang, and Yu Tian. Root defence strategies: Ensuring safety of llm at the decoding level. arXiv preprint arXiv:2410.06809, 2024.
[267] Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. Safedecoding: Defending against jailbreak attacks via safety-aware decoding. arXiv preprint arXiv:2402.08983, 2024.
[268] Somnath Banerjee, Sayan Layek, Soham Tripathy, Shanu Kumar, Animesh Mukherjee, and Rima Hazra. Safeinfer: Context adaptive decoding time safety alignment for large language models. In Proc. AAAI, pages 27188–27196, 2025.
[269] Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, et al. Safeguarding large language models: A survey. arXiv preprint arXiv:2406.02622, 2024.
[270] Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
[271] Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti. Llama guard 3 vision: Safeguarding human-ai image understanding conversations. arXiv preprint arXiv:2411.10414, 2024.
[272] llama Team. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/, 2024. Accessed: 2024-09-25.
[273] Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. A comprehensive survey of LLM alignment techniques: RLHF, RLAIF, PPO, DPO and more. arXiv preprint arXiv:2407.16216, 2024.
[274] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Proc. NeurIPS, 2022.
[275] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
[276] Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. In Proc. ICML, 2024.
[277] Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. In Proc. NeurIPS, 2024.
[278] Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, and Yaodong Yang. PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference. In Proc. ACL, 2025.
[279] Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, and Timothy Hospedales. Safety fine-tuning at (almost) no cost: A baseline for vision large language models. In Proc. ICML, 2024.
[280] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In Proc. ICLR, 2022.
[281] Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. In Proc. NeurIPS, 2024.
[282] Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, and Jun Zhu. STAIR: Improving Safety Alignment with Introspective Reasoning. arXiv preprint arXiv:2502.02384, 2025.
[283] Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. In Findings of Proc. ACL, 2024.
[284] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
[285] Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. In Findings of Proc. EMNLP, 2023.
[286] Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, et al. Sorry-bench: Systematically evaluating large language model safety refusal behaviors. In Proc. ICLR, 2025.
[287] Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In Proc. NAACL, 2024.
[288] Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks. In Proc. COLM, 2024.
[289] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. In Proc. NeurIPS, 2023.
[290] Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A Hale, and Paul Röttger. Simplesafetytests: a test suite for identifying critical safety risks in large language models. arXiv preprint arXiv:2311.08370, 2023.
[291] Mazeika Mantas, Zou Andy, Mu Norman, Phan Long, Wang Zifan, Yu Chunru, Khoja Adam, Jiang Fengqing, O’Gara Aidan, Sakhaee Ellie, Xiang Zhen, Rajabi Arezoo, Hendrycks Dan, Poovendran Radha, Li Bo, and Forsyth David. Tdc 2023 (llm edition): The trojan detection challenge. In Proc. NeurIPS Competition Track, 2023.
[292] Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, and Bo Li. ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming. arXiv preprint arXiv:2404.08676, 2024.
[293] Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proc. CVPR, pages 13807–13816, 2024.
[294] Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. In Findings of Proc. ACL, 2024.
[295] Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, and Qi Liu. VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment. In Proc. EMNLP, 2024.
[296] Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Conghui Zhang, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, et al. Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models. arXiv preprint arXiv:2503.17682, 2025.
[297] Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, et al. Mm-rlhf: The next step forward in multimodal llm alignment. In Proc. ICML, 2025.
[298] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[299] Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, et al. Mitigating the alignment tax of rlhf. In Proc. EMNLP, 2024.
[300] Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58, 2014.
[301] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822, 2018.
[302] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
[303] Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 35(1):5–22, 2022.
[304] Jiyang Guan, Zhuozhuo Tu, Ran He, and Dacheng Tao. Few-shot backdoor defense using shapley estimation. In Proc. CVPR, pages 13358–13367, 2022.
[305] Jiyang Guan, Jian Liang, and Ran He. Backdoor defense via test-time detecting and repairing. In Proc. CVPR, pages 24564–24573, 2024.
[306] Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024.
[307] Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. In Proc. NAACL, 2024.
[308] Jiawen Shi, Yixin Liu, Pan Zhou, and Lichao Sun. Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt. arXiv preprint arXiv:2304.12298, 2023.
[309] Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, and Yang Liu. Badedit: Backdooring large language models by model editing. In Proc. ICLR, 2024.
[310] Haoran Wang and Kai Shu. Trojan activation attack: Red-teaming large language models using activation steering for safety-alignment. arXiv preprint arXiv:2311.09433, 2023.
[311] Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large language models with virtual prompt injection. In Proc. NAACL, 2024.
[312] Cody Clop and Yannick Teglia. Backdoored retrievers for prompt injection attacks on retrieval augmented generation of large language models. arXiv preprint arXiv:2410.14479, 2024.
[313] Shuai Zhao, Jinming Wen, Luu Anh Tuan, Junbo Zhao, and Jie Fu. Prompt as triggers for backdoor attack: Examining the vulnerability in language models. In Proc. EMNLP, 2023.
[314] Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In Proc. USENIX Security, pages 1831–1847, 2024.
[315] Houssem Ben Braiek and Foutse Khomh. Machine learning robustness: A primer. In Trustworthy AI in Medical Imaging, pages 37–71. Elsevier, 2025.
[316] Xuezhi Wang, Haohan Wang, and Diyi Yang. Measure and improve robustness in NLP models: A survey. In Proc. NAACL, 2022.
[317] Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In Proc. ICLR, 2024.
[318] Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Tao Gui, Qi Zhang, and Xuanjing Huang. Self-polish: Enhance reasoning in large language models via problem refinement. In Findings of Proc. EMNLP, 2023.
[319] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proc. CVPR, pages 9556–9567, 2024.
[320] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proc. ICLR, 2024.
[321] Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. In Proc. NeurIPS, 2024.
[322] Qingjie Zhang, Han Qiu, Di Wang, Haoting Qian, Yiming Li, Tianwei Zhang, and Minlie Huang. Understanding the Dark Side of LLMs’ Intrinsic Self-Correction. arXiv preprint arXiv:2412.14959, 2024.
[323] Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187, 2024.
[324] Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less. arXiv preprint arXiv:2502.18600, 2025.
[325] Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. CoT-Valve: Length-Compressible Chain-of-Thought Tuning. arXiv preprint arXiv:2502.09601, 2025.
[326] Junjie Yang, Ke Lin, and Xing Yu. Think when you need: Self-adaptive chain-of-thought learning. arXiv preprint arXiv:2504.03234, 2025.
[327] Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, and Wenji Mao. Adaptive Thinking via Mode Policy Optimization for Social Language Agents. arXiv preprint arXiv:2505.02156, 2025.
[328] Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067, 2025.
[329] Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim, Yongjin Yang, Yujin Kim, and Se-Young Yun. Self-training elicits concise reasoning in large language models. arXiv preprint arXiv:2502.20122, 2025.
[330] Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. Distilling system 2 into system 1. In NeurIPS Workshop on Sys-2 Reasoning, 2024.
[331] Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, and Jiaxin Huang. Efficient test-time scaling via self-calibration. arXiv preprint arXiv:2503.00031, 2025.
[332] Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, and Rui Wang. Sampling-efficient test-time scaling: Self-estimating the best-of-n sampling in early decoding. arXiv preprint arXiv:2503.01422, 2025.
[333] Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, et al. Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization. arXiv preprint arXiv:2501.17974, 2025.
[334] Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, and Yinpeng Dong. Mitigating Overthinking in Large Reasoning Models via Manifold Steering. arXiv preprint arXiv:2505.22411, 2025.
[335] Hannah Cyberey and David Evans. Steering the CensorShip: Uncovering Representation Vectors for LLM “Thought" Control. arXiv preprint arXiv:2504.17130, 2025.
[336] Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Position: Trustllm: Trustworthiness in large language models. In Proc. ICML, 2024.
[337] Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. A survey on fairness in large language models. arXiv preprint arXiv:2308.10149, 2023.
[338] Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. Bias and fairness in large language models: A survey. Computational Linguistics, 50(3):1097–1179, 2024.
[339] Riccardo Cantini, Alessio Orsino, Massimo Ruggiero, and Domenico Talia. Benchmarking adversarial robustness to bias elicitation in large language models: Scalable automated assessment with llm-as-a-judge. arXiv preprint arXiv:2504.07887, 2025.
[340] Yanbo Wang, Jian Liang, and Ran He. Towards eliminating hard label constraints in gradient inversion attacks. In Proc. ICLR, 2024.
[341] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In Proc. S&P, pages 3–18, 2017.
[342] Zhanke Zhou, Jianing Zhu, Fengfei Yu, Xuan Li, Xiong Peng, Tongliang Liu, and Bo Han. Model inversion attacks: A survey of approaches and countermeasures. arXiv preprint arXiv:2411.10023, 2024.
[343] Le Jiang, Liyan Ma, and Guang Yang. Shadow defense against gradient inversion attack in federated learning. Medical Image Analysis, page 103673, 2025.
[344] Jiyang Guan, Jian Liang, and Ran He. Are you stealing my model? sample correlation for fingerprinting deep neural networks. In Proc. NeurIPS, 2022.
[345] Jiyang Guan, Jian Liang, Yanbo Wang, and Ran He. Sample Correlation for Fingerprinting Deep Face Recognition. International Journal of Computer Vision, 133(4):1912–1926, 2025.
[346] Changyue Jiang, Xudong Pan, Geng Hong, Chenfu Bao, and Min Yang. Rag-thief: Scalable extraction of private data from retrieval-augmented generation applications with agent-based attacks. arXiv preprint arXiv:2411.14110, 2024.
[347] Yuhao Wang, Wenjie Qu, Yanze Jiang, Zichen Liu, Yue Liu, Shengfang Zhai, Yinpeng Dong, and Jiaheng Zhang. Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries. arXiv preprint arXiv:2505.15420, 2025.
[348] Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, et al. Stealing part of a production language model. In Proc. ICML, 2024.
[349] Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. In Proc. NeurIPS, 2024.
[350] Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms. In Proc. COLM, 2024.
[351] Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms. arXiv preprint arXiv:2402.16835, 2024.
[352] Shengyuan Hu, Yiwei Fu, Steven Wu, and Virginia Smith. Jogging the memory of unlearned llms through targeted relearning attacks. In NeurIPS Workshop on Safe Generative AI, 2024.
[353] Zirui Peng, Shaofeng Li, Guoxing Chen, Cheng Zhang, Haojin Zhu, and Minhui Xue. Fingerprinting deep neural networks globally via universal adversarial perturbations. In Proc. CVPR, pages 13430–13439, 2022.
[354] Si Wang and Chip-Hong Chang. Fingerprinting deep neural networks-a deepfool approach. In Proc. ISCAS, pages 1–5, 2021.
[355] Jiyang Guan, Jian Liang, Yanbo Wang, and Ran He. Sample Correlation for Fingerprinting Deep Face Recognition. International Journal of Computer Vision, pages 1–15, 2024.
[356] Pierre Fernandez, Guillaume Couairon, Hervé Jégou, Matthijs Douze, and Teddy Furon. The stable signature: Rooting watermarks in latent diffusion models. In Proc. ICCV, pages 22466–22477, 2023.
[357] Yuanchun Li, Ziqi Zhang, Bingyan Liu, Ziyue Yang, and Yunxin Liu. ModelDiff: Testing-based DNN similarity comparison for model reuse detection. In Proc. ISSTA, pages 139–151, 2021.
[358] Shaopeng Fu, Fengxiang He, Yang Liu, Li Shen, and Dacheng Tao. Robust unlearnable examples: Protecting data against adversarial learning. In Proc. ICLR, 2022.
[359] Pedro Sandoval-Segura, Vasu Singla, Jonas Geiping, Micah Goldblum, Tom Goldstein, and David Jacobs. Autoregressive perturbations for data poisoning. In Proc. NeurIPS, 2022.
[360] Hanxun Huang, Xingjun Ma, Sarah Monazam Erfani, James Bailey, and Yisen Wang. Unlearnable examples: Making personal data unexploitable. In Proc. ICLR, 2021.
[361] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In Proc. ICML, pages 17061–17084, 2023.
[362] Robin Staab, Mark Vero, Mislav Balunović, and Martin Vechev. Beyond memorization: Violating privacy via inference with large language models. In Proc. ICLR, 2024.
[363] Batuhan Tömekçe, Mark Vero, Robin Staab, and Martin Vechev. Private Attribute Inference from Images with Vision-Language Models. In Proc. NeurIPS, 2024.
[364] Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161, 2025.
[365] Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models. arXiv preprint arXiv:2504.11468, 2025.

A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models《基于大型语言模型的推理可信度综合调查》

Abstract 摘要

1 Introduction 1 引言

2 Background 2 背景

2.1 Large Language Model Reasoning2.1 大型语言模型推理

2.2 Chain-of-Thought Prompting2.2Chain-of-Thought 提示

2.3 Large Reasoning Models2.3 大型推理模型

2.3.1 Model Training 2.3.1 模型训练

2.3.2 Multimodal LRM 2.3.2 多模态 LRM

3 Truthfulness 3 真确性

3.1 Hallucination 3.1 幻觉

3.1.1 Hallucination with Reasoning Techniques3.1.1Hallucination 使用推理技术

3.1.2 Hallucination in Reasoning Models3.1.2Hallucination 在推理模型中

3.2 Faithfulness of Reasoning Models3.2 推理模型的忠实性

3.2.1 Faithfulness Measuring3.2.1 忠实性度量

3.2.2 Faithfulness Understanding3.2.2 忠实性理解

3.2.3 Faithfulness Improvement3.2.3 忠实性改进

3.2.4 Further Discussion of Faithfulness Definition3.2.4 对忠实度定义的进一步讨论