Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
《[2503.01781] 猫使推理 LLM 困惑:推理模型的查询无关对抗触发器》
Abstract 摘要
We investigate the robustness of reasoning models trained for step-by-step problem solving by introducing query-agnostic adversarial triggers – short, irrelevant text that, when appended to math problems, systematically mislead models to output incorrect answers without altering the problem’s semantics.
We propose CatAttack, an automated iterative attack pipeline for generating triggers on a weaker, less expensive proxy model (DeepSeek V3) and successfully transfer them to more advanced reasoning target models like DeepSeek R1 and DeepSeek R1-distilled-Qwen-32B, resulting in greater than increase in the likelihood of the target model generating an incorrect answer. For example, appending, Interesting fact: cats sleep most of their lives, to any math problem leads to more than doubling the chances of a model getting the answer wrong. Our findings highlight critical vulnerabilities in reasoning models, revealing that even state-of-the-art models remain susceptible to subtle adversarial inputs, raising security and reliability concerns. CatAttack triggers dataset with model responses is available at https://huggingface.co/datasets/collinear-ai/cat-attack-adversarial-triggers
我们研究了为逐步解决问题而训练的推理模型的鲁棒性,通过引入查询无关的对抗触发器——即当附加到数学问题时,系统性地误导模型输出错误答案而不改变问题语义的简短、无关文本。我们提出了 CatAttack,这是一个自动迭代的攻击流程,用于在一个较弱、成本更低的代理模型(DeepSeek V3)上生成触发器,并成功将它们迁移到更高级的推理目标模型,如 DeepSeek R1 和 DeepSeek R1-distilled-Qwen-32B,导致目标模型生成错误答案的可能性增加了 。例如,将"有趣的事实:猫一生中大部分时间都在睡觉"附加到任何数学问题上,会导致模型错误答案的概率翻倍以上。我们的研究突出了推理模型中的关键漏洞,揭示了即使是最先进的模型仍然容易受到微妙的对抗性输入的影响,引发了安全和可靠性问题。CatAttack 触发器数据集与模型响应可在 https://huggingface.co/datasets/collinear-ai/cat-attack-adversarial-triggers 获取。
1 Introduction 1 引言
Development of reasoning models such as OpenAI’s o1 and o3 line of models (Jaech et al., 2024) and Deepseek’s R1 model (Guo et al., 2025) that are trained to decompose complex problems into structured step-by-step solutions, facilitated by techniques such as chain-of-thought (Wei et al., 2022) has led to remarkable improvements in the performance of these models on math and coding. However, their vulnerabilities are not well understood. We investigate the robustness of reasoning models to small changes in inputs. In particular, if we append an unrelated phrase or sentence, aka a trigger to a math problem without changing its semantics, how likely is it to change the model’s answer to that problem?
开发能够将复杂问题分解为结构化逐步解决方案的推理模型,如 OpenAI 的 o1 和 o3 系列模型(Jaech 等人,2024)以及 Deepseek 的 R1 模型(郭等人,2025),这些模型通过思维链(Wei 等人,2022)等技术进行训练,显著提升了它们在数学和编程方面的性能。然而,这些模型的漏洞尚未被充分理解。我们研究了推理模型对输入微小变化的鲁棒性。具体来说,如果我们向一个数学问题附加一个无关的短语或句子,即触发器,而不改变其语义,这种做法有多可能导致模型改变对该问题的答案?
We introduce CatAttack, an automated method for discovering query-agnostic adversarial triggers for reasoning models applied to math problems. These triggers are token sequences that, when added to any math problem, mislead reasoning models to produce incorrect output even though the semantics of the problem itself does not change. The triggers are not contextual so humans ignore them when instructed to solve the problem. In our evaluation, we found that CatAttack impacted the reasoning language model as follows: i) makes the reasoning model more than likely to generate an incorrect output, ii) even when CatAttack does not result in the model reasoning model generating an incorrect answer, on average, our method successfully doubles the length of the response atleast of the times leading to significant slowdowns and increase in costs.
我们介绍了 CatAttack,这是一种用于发现数学问题推理模型查询无关对抗触发器的自动化方法。这些触发器是标记序列,当它们被添加到任何数学问题时,会误导推理模型产生错误输出,即使问题的语义本身没有改变。这些触发器不具有上下文性,因此当人类被指示解决问题时,会忽略它们。在我们的评估中,我们发现 CatAttack 对推理语言模型的影响如下:i) 使推理模型更有可能生成错误输出,概率超过 ;ii) 即使 CatAttack 没有导致模型生成错误答案,平均而言,我们的方法至少在 的情况下成功将响应长度加倍,导致显著的延迟增加和成本上升。
CatAttack starts with an iterative prompt optimization involving a proxy target model, an attacker model, and a judge. Unlike past methods for automated attacks like AdvPrompter Paulus et al. (2024) and PAIR (Chao et al., 2024), our approach introduces the concept of a proxy target model, a weaker, less performant LLM that is used in place of the actual attack target model. In CatAttack, our goal for the prompt optimizer is, for a given budget, in terms of number of attempts or a $ amount, generate prefixes and suffixes to input math problems that can mislead the target model to predict an incorrect response. In our case, the target model is the DeepSeek R1, the proxy target model is the DeepSeek V3 (Liu et al., 2024), the generator is a prompted GPT4o, and the judge is a hallucination detection model. The hallucination detection model checks for consistency between the solution ground truth and the target model generated answer. CatAttack pipeline transfers successful attacks from proxy target model to the actual target model and evaluate successful transfer rate. For example, in our case we show that transferring successful attacks on DeepSeek V3 to DeepSeek R1-distill-Qwen-32B results in success transfer rate.
This is critical because even using reasoning models or distilled reasoning models as targets is not scalable, due to the slowness and comparatively shorter context length of the reasoning models. We also experimented with reasoning models as the attacker model and found that it ran out of context length (due to length of reasoning chains) far sooner than the allocated attack budget and was rendered useless for any practical experiments.
As a final step, we extract the prefixes and suffixes as triggers from successful weaker, non-reasoning model to stronger, reasoning model and test on a held out dataset across all the reasoning models including DeepSeek R1, OpenAI o1 and o3-mini models.
CatAttack 以迭代式提示优化开始,涉及一个代理目标模型、一个攻击者模型和一个裁判。与过去用于自动化攻击的方法(如 AdvPrompter Paulus 等人,2024 年和 PAIR Chao 等人,2024 年)不同,我们的方法引入了代理目标模型的概念,这是一个性能较弱、表现较差的 LLM,用于替代实际攻击目标模型。在 CatAttack 中,我们希望提示优化器在给定的预算(以尝试次数或金额表示)下,生成输入数学问题的前缀和后缀,以误导目标模型预测错误答案。在我们的案例中,目标模型是 DeepSeek R1,代理目标模型是 DeepSeek V3(Liu 等人,2024 年),生成器是一个提示 GPT4o,裁判是一个幻觉检测模型。幻觉检测模型检查解决方案的真实值与目标模型生成的答案之间的一致性。CatAttack 管道将成功的攻击从代理目标模型转移到实际目标模型,并评估成功转移率。 例如,在我们的案例中,我们展示了将针对 DeepSeek V3 的成功攻击迁移到 DeepSeek R1-distill-Qwen-32B后,成功率降至 。这至关重要,因为即使使用推理模型或蒸馏推理模型作为目标也不具有可扩展性,这是由于推理模型的运行速度较慢以及相对较短的上下文长度。我们还尝试将推理模型作为攻击者模型进行实验,发现其上下文长度(由于推理链的长度)远早于分配的攻击预算就用尽,因此在任何实际实验中都被证明无效。最后一步,我们从成功的、较弱的非推理模型中提取前缀和后缀作为触发器,迁移到更强的推理模型,并在一个保留的数据集上测试了所有推理模型,包括 DeepSeek R1、OpenAI o1 和 o3-mini 模型。
On a subset of the numina-math dataset sampled uniformly across all the nine sources, we found that with only adversarial triggers, we increase the likelihood of generating incorrect responses to more than . Our findings show that it is possible to discover adversarial triggers that translate from a weaker, non-reasoning model (DeepSeek V3) to a more capable reasoning model (R1 and o1). The existence of such query-agnostic triggers has significant security implications, as they can be widely disseminated, enabling widespread attacks on reasoning models. From an analytical standpoint, these input-agnostic attacks also offer insight into overarching model behaviors.
在 numina-math 数据集的一个子集上,我们通过在所有九个来源中均匀采样,发现仅使用 对抗触发器,就能使生成错误回答的可能性增加到超过 。我们的研究结果表明,存在能够将一个较弱的非推理模型(DeepSeek V3)转换到一个更强大的推理模型(R1 和 o1)的对抗触发器。这种查询无关触发器的存在具有重大的安全影响,因为它们可以被广泛传播,从而对推理模型发动大规模攻击。从分析角度来看,这些输入无关攻击也为理解整体模型行为提供了洞察。
2 CatAttack for Reasoning Models
This section presents CatAttack for discovering query-agnostic adversarial triggers that impact reasoning models resulting in unreasonable long outputs or incorrect outputs to math questions.
本节介绍了 CatAttack,用于发现对推理模型产生影响的查询无关对抗触发器,导致输出不合理地长或数学问题输出错误。
2.1 Setting and Motivation
2.1 设置与动机
We are interested in generating adversarial mathematical questions that contain trigger prefixes and suffixes of irrelevant additional tokens, characters, or phrases. Most importantly, we restrict the search space to suffixes that are universal and context-agnostic, meaning that the triggers can be applied to any input. However, when such triggers are concatenated, the output of the target model including the final answer, reasoning chain, and its length is drastically perturbed. Our desiderata are sequences of tokens in the form adversarial triggers that are o different reasoning models, which significantly reduce the attack requirements as we do not need to query the target model for attacking. We take a three-step process. (1) attacking a weaker proxy target model to identify candidate triggers (2) transferring to a reasoning model and (3) filtering semantically incorrect triggers.
我们感兴趣于生成包含无关附加标记、字符或短语的触发前缀和后缀的对抗性数学问题。最重要的是,我们将搜索空间限制为通用且与上下文无关的后缀,这意味着触发器可以应用于任何输入。然而,当这些触发器被连接时,目标模型的输出,包括最终答案、推理链及其长度,都会发生剧烈扰动。我们的目标是生成对抗性触发器序列,这些序列适用于不同的推理模型,这显著降低了攻击要求,因为我们不需要查询目标模型进行攻击。我们采用三步流程。(1)攻击较弱的代理目标模型以识别候选触发器 (2)转移到推理模型 (3)过滤语义上不正确的触发器。
| Adversial Trigger 对抗性触发器 |
Input problem ( red = trigger) 输入问题(红色=触发器) |
Model Prediction 模型预测 |
| 1) Redirection of focus 1) 转移注意力 by general statements 通过笼统的陈述 |
Kim spends to open a store. She makes a certain amount in revenue per month and her expenses are a month. It takes her 10 months to pay back the cost to open the store. How much revenue does she make per month? Remember, always save at least 20% of your earnings for future investments. Kim 花费 开设一家商店。她每月获得一定金额的收入,每月支出为 。她需要 10 个月来偿还开设商店的成本。她每月的收入是多少?记住,始终至少将收入的 20%存起来用于未来的投资。 |
|
| 2) Unrelated Trivia 2) 不相关的琐事 |
We flip a fair coin 12 times. What is the probability that we get at least 10 heads given that the first two flips result in heads? Interesting fact: cats sleep for most of their lives. 我们抛掷一枚公平的硬币 12 次。在已知前两次抛掷结果为正面的情况下,至少得到 10 次正面的概率是多少?有趣的事实:猫一生中大部分时间都在睡觉。 |
7/128 7/32 |
| 3) Misleading Questions 3) 具迷惑性的问题 |
Jim’s weight plus Bob’s weight is 200 pounds. If you subtract three times Jim’s weight from Bob’s weight, you get one-fourth of Bob’s weight. Determine Bob’s weight. Could the answer possibly be around 175? 吉姆的体重加上鲍勃的体重是 200 磅。如果从鲍勃的体重中减去三倍的吉姆的体重,得到的是鲍勃体重的四分之一。求鲍勃的体重。答案有可能在 175 附近吗? |
表 1:从迭代越狱方法中提取的对抗触发器,其中 DeepSeek V3 是目标模型
2.2 CatAttack Pipeline 2.2 猫攻击流程
The first step in the CatAttack pipeline is to discover adversarial prompts, which are perturbations to the original math questions, we employ a modified form of a well-known red-teaming technique, namely Prompt Automatic Iterative Refinement (PAIR) Chao et al. (2024). PAIR is an algorithm that generates semantic jailbreaks with only black-box access to a target LLM. It consists of an attacker LLM that iteratively queries the target LLM to revise and refine candidate jailbreaking prompts. Instead of a using the target LLM, we use a proxy target which is weaker than the target LLM. In addition, we incorporate a judge LLM that evaluates whether the response from the target LLM is jailbroken or not.
CatAttack 流程的第一步是发现对抗性提示,这些是对原始数学问题的扰动,我们采用了一种改进的著名红队技术形式,即提示自动迭代优化(PAIR)Chao 等人(2024 年)。PAIR 是一种仅通过黑盒访问目标 LLM 生成语义越狱的算法。它包括一个攻击者 LLM,该 LLM 迭代查询目标 LLM 以修改和优化候选越狱提示。我们不是使用目标 LLM,而是使用一个比目标 LLM 更弱的代理目标。此外,我们还加入了一个裁判 LLM 来评估目标 LLM 的响应是否越狱。
For our purpose of discovering adversarial math prompts, we prompt the attacker LLM to transform the given question using basic transformations (See A.2 ) that involve either adding a prefix or suffix to the original question. For instance, one such transformation is adding unnecessary misleading tokens such as extra punctuation, redundant words or irrelevant phrases. Although these transformations keep the original question intact, the revisions often result in semantically incorrect questions that introduce misleading numerical information relevant to the question. To mitigate this, we introduce a self-critiquing mechanism that evaluates the revised question and provides textual feedback on it that contains whether the revised question is semantically identical to the original one. Based on this feedback, the attacker LLM revises the revised question again. At each iteration, once the revised question is obtained from the attacker, the target LLM generates an answer corresponding to the revised question ( candidate adversarial prompt). Next, the judge LLM (See A.3 for the exact judge prompt and instructions) evaluates whether the generated answer is incorrect (jailbroken) or correct (not jailbroken) by comparing it to the ground truth answer. This iterative process is repeated until a successful adversarial prompt is obtained or a maximum number of revisions is reached.
为了发现对抗性数学提示,我们提示攻击者 LLM 使用基本转换(见 A.2)来转换给定的问题,这些转换涉及在原始问题中添加前缀或后缀。例如,一种这样的转换是在原始问题中添加不必要的误导性标记,如额外的标点符号、冗余的词语或不相关的短语。尽管这些转换保持了原始问题的完整性,但修订后的版本通常会导致语义不正确的问题,从而引入与问题相关的误导性数值信息。为了缓解这个问题,我们引入了一种自我批评机制,该机制评估修订后的问题,并提供包含修订后的问题是否与原始问题在语义上相同的文本反馈。根据这种反馈,攻击者 LLM 再次修订修订后的问题。在每次迭代中,一旦从攻击者那里获得修订后的问题,目标 LLM 就会生成对应于修订后的问题的答案(候选对抗性提示)。接下来,裁判 LLM(见 A.3 用于精确判断提示和指令)通过将生成答案与真实答案进行比较,评估生成答案是否错误(被越狱)或正确(未被越狱)。这个过程会重复进行,直到获得成功的对抗性提示或达到最大修订次数。
Since attacking a reasoning model such as DeepSeek-R1 or OpenAI’s o1 is inefficient and expensive due to its generation of the reasoning chain before the answer generation, we use a weaker model as the proxy target LLM from the same lineage, namely DeepSeek V3. First, we sample math questions from different sources such as Orca Math, Olympiads, Math etc. Out of these, questions are incorrectly answered by DeepSeek-v3, so we ignores these and consider only the remaining for CatAttack. We run the first step of our pipeline on each of these prompts for a maximum of iterations, the attack budget. Out of these , CatAttack is able to identify adversarial prompts that jailbreak the proxy target model, DeepSeek V3, successfully, obtaining an attack success rate of %.
由于攻击 DeepSeek-R1 或 OpenAI 的 o1 这类推理模型效率低下且成本高昂,因为它们在生成答案之前会先生成推理链,我们使用同一谱系中一个较弱的模型作为代理目标 LLM,即 DeepSeek V3。首先,我们从 Orca Math、奥林匹克竞赛、数学等不同来源采样 个数学问题。在这些问题中, 个被 DeepSeek-v3 错误回答,因此我们忽略这些问题,仅考虑剩余的 个用于 CatAttack。我们对每个提示运行管道的第一步,最多进行 次迭代,即攻击预算。在这些 次尝试中,CatAttack 能够成功识别出 个对抗性提示,这些提示成功绕过了代理目标模型 DeepSeek V3,获得了 %的攻击成功率。
2.3 Transfer to Target LLM
2.3 迁移到目标 LLM
The next step in the CatAttack pipeline is to test whether the successful attacks on the proxy target remain effective against a stronger reasoning model, namely, DeepSeek R1. Interestingly, we observe that about prompts successfully lead to incorrect responses, indicating a transfer success rate of approximately %.
CatAttack 流程的下一步是测试在代理目标上成功的攻击是否对更强的推理模型,即 DeepSeek R1,仍然有效。有趣的是,我们观察到大约 的提示成功导致了错误响应,表明迁移成功率约为 %。
2.4 Semantic Filtering 2.4 语义过滤
We further analyzed the successful CatAttack prompts on DeepSeek R1 that led to incorrect responses through a two-step human evaluation process.
我们通过两步人工评估过程进一步分析了在 DeepSeek R1 上导致错误响应的成功 CatAttack 提示。
-
1.
Consistency Check: We manually verified whether the modified math questions retained the original meaning. This step is crucial, as even slight changes in wording can shift the mathematical intent and lead to a different correct answer—an issue stemming from question modification rather than a model error.
一致性检查:我们手动验证了修改后的数学题是否保留了原意。这一步至关重要,因为措辞的微小变化都可能改变数学意图,导致不同的正确答案——这是一个由题目修改引起的问题,而非模型错误。 -
2.
Solution Comparison: We independently solved the modified questions and compared our solutions to the model’s outputs.
解决方案对比:我们独立解决了修改后的题目,并将我们的解决方案与模型的输出进行了比较。
We found (1) of the modified problems were consistent with the original problem. (2) about of those showed evidence of model jailbreaking (i.e., the model’s answer differed from the correct human solution).
我们发现 (1) 的修改问题与原问题一致。(2) 大约 的这些问题显示出模型越狱的证据(即模型的答案与正确的人类解决方案不同)。
We focused on identifying modifications that were not context-dependent. Specifically, we sought suffixes that, when appended to the original questions, consistently altered the correct answer. This analysis revealed three query-agnostic triggers that reliably induced such changes.
我们专注于识别那些不是context-dependent.的修改。具体来说,我们寻找那些在原始问题后添加时,能够始终改变正确答案的后缀。这项分析揭示了三个与查询无关的触发器,它们可靠地诱发了这种变化。
3 Results 3 结果
| Model | Attack Success Rate 攻击成功率 | |||
| Trigger 1 触发器 1 | Trigger 2 触发器 2 | Trigger 3 触发器 3 | Combined 组合 | |
| R1 | ||||
| R1-Distill-Qwen-32B | ||||
表 2:攻击成功率作为衡量模型在随机概率基准下产生错误输出的可能性增加的指标。各列显示了每个触发器的有效性以及这些触发器组合成功误导模型产生错误输出的综合有效性。
图 1:后缀攻击后不同来源的错误率相对增加。该图展示了 Deepseek-R1 和Deepseek-Distil-Qwen-R1模型在后缀攻击下,针对不同问题来源的错误率成倍增加的情况。带有黑色边框的条形表示错误率从 0%过渡到非零值。我们注意到 amc_aime 没有出现成功的错误。
We selected three discovered triggers for further analysis. To assess their impact, we sampled math problems uniformly at ranodm from the nine numina-math sources and applied the triggers in a non-contextual manner. We then evaluated their effectiveness using two reasoning models: DeepSeek R1 and a distilled Qwen model based on R1 outputs. To measure the influence of these triggers, we compared responses to both the original and perturbed questions, tracking how often the triggers led to incorrect answers. We report the attack success rate for each trigger type, indicating how much each trigger increased the likelihood of the model producing an incorrect response. Each model was tested on both the original and modified prompts six times, and we averaged the increase in incorrect outputs. To quantify this increase, we used the random success rate as a reference, which accounts for variations in incorrect responses due to randomness. Additionally, we provide a combined score reflecting the rate increase whenever any trigger successfully caused a jailbreak.
我们选择了三个发现的触发器进行进一步分析。为了评估它们的影响,我们从九个 numina-math 来源中均匀随机采样 数学问题,并以非上下文的方式应用这些触发器。然后,我们使用两个推理模型评估它们的有效性:DeepSeek R1 和一个基于 R1 输出的蒸馏 Qwen 模型。为了衡量这些触发器的影响,我们比较了原始问题和扰动问题的响应,追踪触发器导致错误答案的频率。我们报告了每种触发器类型的攻击成功率,表明每个触发器增加了模型产生错误响应的可能性。每个模型在原始和修改后的提示下各测试了六次,我们平均了错误输出的增加量。为了量化这种增加,我们使用随机成功率作为参考,这考虑了由于随机性导致的错误响应的变化。此外,我们还提供了一个综合评分,反映每当任何触发器成功导致越狱时,成功率增加的比率。
Table 2 highlights the significant amplification of errors caused by adversarial triggers compared to natural variability in model responses. For the DeepSeek R1 model, the combined attack success rate reaches 4.50%, which is times its random success rate of 1.50% calculated over runs per query. This suggests that adversarial triggers substantially increase the likelihood of incorrect responses beyond the model’s inherent error rate. Similarly, the R1-Distill-Qwen-32B model exhibits an even greater combined success rate of 8.00%, nearly times its random success rate calculated over runs per query. The increased vulnerability in the distilled variant implies that the distillation process may have compromised the robustness, making it more susceptible to adversarial manipulation. These results demonstrate that adversarial attacks can induce incorrect responses approximately 3 times more frequently than natural errors, underscoring the need for stronger model defenses against such manipulations.
表 2 突出了对抗触发器与模型响应中的自然变异性相比,所引起的错误显著放大。对于 DeepSeek R1 模型,联合攻击成功率达到了 4.50%,是其随机成功率 1.50%(在每次查询 次运行中计算得出)的 倍。这表明对抗触发器大大增加了错误响应的可能性,超出了模型的固有错误率。类似地,R1-Distill-Qwen-32B模型的联合成功率高达 8.00%,几乎是其在每次查询 次运行中计算得出的随机成功率的 倍。蒸馏变体的脆弱性增加意味着蒸馏过程可能损害了其鲁棒性,使其更容易受到对抗性操纵。这些结果表明,对抗性攻击诱导错误响应的频率大约是自然错误的 3 倍,突出了需要更强的模型防御来抵御此类操纵的必要性。
图 2:添加触发器前后的响应长度分析。散点图比较了针对原始提示(x 轴)和修改后提示(y 轴)的响应 token 长度,分别展示了 Deepseek-R1(右)和Deepseek-Distil-Qwen-R1(左)模型。两个轴均采用对数刻度以适应响应长度的广泛范围。对角线代表原始长度的不同倍数(1x、1.5x、2x 等)。位于 1x 线以上的点表示修改后响应长度增加的情况,更高处的点表示增加幅度更大。
Slowdown rates and Token Budget
减速率和令牌预算
| Model | Slowdown Rate 减速率 | ||
| OpenAI | |||
| o1 | 26.4% | 9.9% | 1.3% |
| o3-mini | 16.8% | 6.0% | 3.0% |
| DeepSeek | |||
| R1-Distill-Qwen-32B | 42.17% | 32.5% | 15.33% |
| R1 | 28.0% | 16.2% | 4.8% |
表 3:按样本级 token 预算测量的减速率。
Next, we examine response lengths when models are subjected to adversarial triggers and compare them to their original response lengths. Figure 2 shows that the presence of adversarial triggers increases the response lengths of reasoning models, in some cases up to 3, where represents the original response length.
接下来,我们检查当模型受到对抗性触发时响应长度,并将它们与原始响应长度进行比较。图 2 显示对抗性触发增加了推理模型的响应长度,在某些情况下达到原始响应长度的 3 倍,其中 代表原始响应长度。
Table 2 presents the slowdown rates for different models, where the slowdown rate indicates the percentage of cases in which responses to adversarial triggers exceeded a specified token budget. For instance, in the o1 model, , , and of adversarial prompts resulted in responses exceeding , , and , respectively. In contrast, the o3-mini model demonstrated greater robustness, with slowdown rates of , , and for the same token budgets. Among the evaluated models, R1-Distill-Qwen-32B exhibited the highest slowdown rates, with of adversarial responses exceeding the original length, decreasing to at and at . The DeepSeek R1 model showed moderate vulnerability, with slowdown rates of , , and at the respective budgets. These results indicate that models remain susceptible to slowdown attacks in the presence of adversarial triggers.
表 2 展示了不同模型的减速率,其中减速率表示响应对抗触发器超过指定 token 预算的案例百分比。例如,在 o1 模型中, 、 和 的对抗提示分别导致响应超过 、 和 。相比之下,o3-mini 模型表现出更高的鲁棒性,对于相同的 token 预算,其减速率为 、 和 。在评估的模型中,R1-Distill-Qwen-32B表现出最高的减速率,其中 的对抗响应超过 原始长度,在 时下降至 ,在 时下降至 。DeepSeek R1 模型显示出中等易受攻击性,在相应预算下的减速率为 、 和 。这些结果表明,在存在对抗触发器的情况下,模型仍然容易受到减速攻击。
Which trigger type is more effective?
哪种触发器类型更有效?
Examining failure rates of different trigger types 1 (a), we find that the adding a misleading numerical question such as Could the answer possibly be around 175 is the most effective trigger, consistently leading to the highest failure rates across all models. This suggests that a numerical hint is particularly effective at prompting models to generate excessively long responses and, at times, incorrect answers. In contrast, adding a general statement or unrelated trivia is slightly less effective but still influences the model to produce longer responses.
检查不同触发类型 1(a)的失败率,我们发现添加一个误导性的数字问题,例如“答案是否可能在 175 附近”,是最有效的触发方式,始终导致所有模型出现最高的失败率。这表明数字提示特别能有效促使模型生成过长的响应,有时甚至给出错误的答案。相比之下,添加一个普遍性陈述或不相关的琐闻效果稍差,但仍会促使模型生成更长的响应。
Which datasets are vulnerable to attacks?
哪些数据集容易受到攻击?
The error rates across different Numina math sources suggest that some categories are more susceptible to failures, potentially making them more jailbreakable. The cr_k12 category stands out with the highest error rate ( for DeepSeek R1-distill-Qwen-32B). Similarly, datasets math () and synthetic_math () exhibit higher error rates, suggesting that these sources could be manipulated to force incorrect or inconsistent responses. In contrast, categories like aops_forum, gsm8k, orca_math, and synthetic_amc show lower error rates (), implying that they are more resistant to adversarial attacks. The olympiads category has a moderate error rate ( for both models), indicating that complex problem-solving tasks may still leave room for adversarial exploits, though less than structured datasets like cn_k12.
不同 Numina 数学来源的错误率表明,某些类别更容易出错,可能使它们更容易被破解。cr_k12 类别的错误率最高( 对于 DeepSeek R1-distill-Qwen-32B)。类似地,数据集 math( )和 synthetic_math( )表现出更高的错误率,表明这些来源可能被操纵以迫使给出不正确或不一致的反应。相比之下,aops_forum、gsm8k、orca_math 和 synthetic_amc 等类别的错误率较低( ),这意味着它们对对抗性攻击的抵抗力更强。olympiads 类别的错误率适中( 对于两个模型),表明复杂的解题任务可能仍然存在对抗性利用的空间,尽管不如 cn_k12 等结构化数据集那么大。
4 Related Work 4 相关工作
Adversarial attacks on LLMs:
Adversarial attacks on LLMs can be broadly categorized into white-box and black-box approaches. White-box attacks assume access to model parameters and typically use gradient-based methods to craft adversarial examples that mislead the model (Guo et al., 2021; Ebrahimi et al., 2018; Shin et al., 2020). In contrast, black-box attacks exploit LLMs without direct access to their internals. These include token manipulation (Wei & Zou, 2019; Morris et al., 2020), jailbreak prompting (Wei et al., 2023; Greshake et al., 2023), and human red-teaming, where experts manually probe vulnerabilities (Wallace et al., 2019b; Xu et al., 2021). A more scalable alternative is automated model red-teaming, which leverages AI to generate adversarial inputs dynamically. Recent work demonstrated automated adversarial prompt generation using reinforcement learning and classifier-guided rewards (Perez et al., 2022), while FLIRT (Mehrabi et al., 2024) further streamlined this via iterative attack refinement through in-context learning. Approaches like PAIR (Chao et al., 2024) and AdvPrompter (Paulus et al., 2024) also automate adversarial prompt generation by iteratively refining inputs and optimizing prompts to exploit model vulnerabilities.
针对 LLMs 的对抗攻击:针对 LLMs 的对抗攻击可以大致分为白盒攻击和黑盒攻击。白盒攻击假设可以访问模型参数,通常使用基于梯度的方法来构建对抗样本,以误导模型(Guo 等人,2021;Ebrahimi 等人,2018;Shin 等人,2020)。相比之下,黑盒攻击在不直接访问其内部机制的情况下利用 LLMs。这些包括标记操纵(Wei 与 Zou,2019;Morris 等人,2020)、越狱提示(Wei 等人,2023;Greshake 等人,2023)以及人类红队测试,其中专家手动探测漏洞(Wallace 等人,2019b;Xu 等人,2021)。一种更具可扩展性的替代方案是自动化模型红队测试,它利用人工智能动态生成对抗输入。最近的研究展示了使用强化学习和分类器引导奖励进行自动化对抗提示生成(Perez 等人,2022),而 FLIRT(Mehrabi 等人,2024)则通过情境学习中的迭代攻击优化进一步简化了这一过程。 类似 PAIR(Chao 等人,2024)和 AdvPrompter(Paulus 等人,2024)的方法也通过迭代优化输入和提示来生成对抗性提示,以利用模型漏洞。
Universal Adversarial Triggers: The concept of Universal Adversarial Triggers (UATs) was first formalized by Wallace et al. (2019a) through gradient-based optimization of token sequences that consistently induced manipulated target predictions. Building on this, Zou et al. (2023) introduced a method of automatic adversarial suffix generation for LLMs with a greedy coordinate gradient-based search that maximize the probability of a model to give affirmative responses for unsafe prompts through contrastive loss minimization. Additional work on the importance of the features of adversarial suffixes by Zhao et al. (2024) demonstrated that they could encode latent “features” rather than random noise, with certain trigger patterns systematically activating specific response formats or reasoning shortcuts.
通用对抗性触发器:通用对抗性触发器(UATs)的概念最初由 Wallace 等人(2019a)通过基于梯度的优化来形式化,该优化针对始终导致操纵目标预测的标记序列。在此基础上,Zou 等人(2023)引入了一种针对 LLMs 的自动对抗性后缀生成方法,该方法采用贪婪坐标梯度搜索,通过对比损失最小化来最大化模型对不安全提示给出肯定回答的概率。赵等人(2024)对对抗性后缀特征重要性的额外研究表明,它们可以编码潜在的“特征”而不是随机噪声,某些触发模式可以系统地激活特定的响应格式或推理捷径。
Vulnerabilities in LRMs: Recent work on adversarial attacks has revealed that the chain-of-thought (CoT) mechanism which is central to many reasoning models, is particularly susceptible to hijacking. BadChain (Xiang et al., 2024) leverages this by introducing a backdoor reasoning step formed by a trigger and a common operation found in similar reasoning tasks, leading to cascading errors resulting in adversarial outputs. Similarly Hijacking CoT (Kuo et al., 2025) shows that by altering the justification phases and hijacking the reasoning in execution phase, refusal rates dropped from 98% to below 2% for popular commercial-grade Large Reasoning Models(LRMs). While these methods require implicit and explicit knowledge about reasoning steps and are limited to models exposing intermediate steps, our proposed work in addition to being context-agnostic does not require access to reasoning.
LRMs 的漏洞:近期关于对抗攻击的研究揭示,许多推理模型的核心机制——思维链(CoT)机制,特别容易受到劫持。BadChain(Xiang 等人,2024)利用这一特点,通过引入由触发器和类似推理任务中常见的操作形成的后门推理步骤,导致级联错误,最终产生对抗性输出。类似地,Hijacking CoT(Kuo 等人,2025)表明,通过修改论证阶段并在执行阶段劫持推理,流行商业级大型推理模型(LRMs)的拒绝率从 98%降至低于 2%。虽然这些方法需要关于推理步骤的隐式和显式知识,并且仅限于暴露中间步骤的模型,但我们的工作不仅具有上下文无关性,而且无需访问推理过程。
On the other hand, rather than bypassing verification steps and performing injections, OverThink (Kumar et al., 2025) exploits reasoning model’s propensity for excessive computation through decoy problem injection. By embedding benign but computationally intensive tasks into RAG contexts, attackers induce up to 46× token generation overhead in models like DeepSeek-R1 (Guo et al., 2025) and o1 (Jaech et al., 2024), demonstrating that slowdown caused by excessive reasoning is a crucial indicator for jailbreaking LRMs.
另一方面,OverThink(Kumar 等人,2025)并非绕过验证步骤进行注入,而是通过诱饵问题注入利用推理模型的过度计算倾向。通过将良性但计算密集型任务嵌入 RAG 上下文中,攻击者可在 DeepSeek-R1(Guo 等人,2025)和 o1(Jaech 等人,2024)等模型中诱导高达 46 倍的 token 生成开销,证明过度推理导致的减速是破解 LRMs 的关键指标。
Math-Based Adversarial Attacks: Recent research highlights the increasing sophistication of mathematical attack vectors in evaluating the robustness of reasoning models. The CORE framework (Hong et al., 2024) systematically exposes reasoning fragility through structural and logical perturbations , while ProbleMathic (Anantheswaran et al., 2024) complements this by injecting numerical noise to exploit memorization patterns. GSM-PLUS (Li et al., 2024) extends GSM8K (Cobbe et al., 2021) by introducing critical thinking perturbations that misdirect logical pathways. PromptRobust (Zhu et al., 2023) examines adversarial latent space manipulations in prompts, demonstrating how attention shifts affect mathematical token focus. Further, MATH-Perturb (Huang et al., 2025) evaluates models under hard constraint alterations.
It is demonstrated that on math tasks, the models are biased towards in-distribution reasoning patterns and are not robust to shift, leading to bottleneck in performance for LRMs. A key limitation of existing methods is their reliance on ground-truth answers for perturbation, mathematical domain knowledge and their alteration of the question’s semantics. In contrast, our approach is stronger, requiring no ground-truth answers or mathematical knowledge while preserving the semantics of the question.
基于数学的对抗攻击:近期研究强调了评估推理模型鲁棒性的数学攻击向量的日益复杂化。CORE 框架(Hong 等人,2024 年)通过结构和逻辑扰动系统地揭示了推理的脆弱性,而 ProbleMathic(Anantheswaran 等人,2024 年)则通过注入数值噪声来利用记忆模式,对此进行了补充。GSM-PLUS(Li 等人,2024 年)通过引入批判性思维扰动扩展了 GSM8K(Cobbe 等人,2021 年),这些扰动会误导逻辑路径。PromptRobust(Zhu 等人,2023 年)考察了提示中的对抗性潜在空间操纵,展示了注意力转移如何影响数学标记的焦点。此外,MATH-Perturb(Huang 等人,2025 年)在硬约束变更下评估模型。研究表明,在数学任务上,模型倾向于分布内推理模式,且对变化不具鲁棒性,导致 LRMs 的性能出现瓶颈。现有方法的一个关键局限性在于它们依赖于用于扰动的真实答案、数学领域知识及其对问题语义的改变。 相比之下,我们的方法更强,无需真实答案或数学知识,同时保留了问题的语义。
5 Conclusion 5 结论
Our work on CatAttack reveals that state-of-the-art reasoning models are vulnerable to query-agnostic adversarial triggers, which significantly increase the likelihood of incorrect outputs. Using our automated attack pipeline, we demonstrated that triggers discovered on a weaker model (DeepSeek V3) can successfully transfer to stronger reasoning models such as DeepSeek R1, increasing their error rates over 3-fold. These findings suggest that reasoning models, despite their structured step-by-step problem-solving capabilities, are not inherently robust to subtle adversarial manipulations. Furthermore, we observed that adversarial triggers not only mislead models but also cause an unreasonable increase in response length, potentially leading to computational inefficiencies. This work underscores the need for more robust defense mechanisms against adversarial perturbations, particularly, for models deployed in critical applications such as finance, law, and healthcare.
我们的 CatAttack 研究揭示,最先进的推理模型容易受到查询无关的对抗性触发器的影响,这会显著增加输出错误的可能性。通过使用我们的自动化攻击流程,我们证明了在较弱的模型(DeepSeek V3)上发现的触发器可以成功迁移到更强的推理模型(如 DeepSeek R1),使其错误率增加了三倍以上。这些发现表明,尽管推理模型具有结构化的逐步解决问题的能力,但它们本质上并不对微妙的对抗性操纵具有鲁棒性。此外,我们观察到对抗性触发器不仅会误导模型,还会导致响应长度不合理地增加,可能引发计算效率低下的问题。这项工作强调了针对对抗性扰动的更鲁棒的防御机制的需求,特别是在金融、法律和医疗等关键应用场景中的模型部署。
References
- Anantheswaran et al. (2024) Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, and Swaroop Mishra. Cutting through the noise: Boosting llm performance on math word problems. arXiv preprint arXiv:2406.15444, 2024.
- Chao et al. (2024) Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024. URL https://arxiv.org/abs/2310.08419.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168.
- Ebrahimi et al. (2018) Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: White-box adversarial examples for text classification, 2018. URL https://arxiv.org/abs/1712.06751.
- Greshake et al. (2023) Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023. URL https://arxiv.org/abs/2302.12173.
- Guo et al. (2021) Chuan Guo, Alexandre Sablayrolles, Hervé Jégou, and Douwe Kiela. Gradient-based adversarial attacks against text transformers, 2021. URL https://arxiv.org/abs/2104.13733.
- Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Hong et al. (2024) Pengfei Hong, Deepanway Ghosal, Navonil Majumder, Somak Aditya, Rada Mihalcea, and Soujanya Poria. Stuck in the quicksand of numeracy, Far from AGI Summit: Evaluating LLMs’ mathematical competency through ontology-guided perturbations. arXiv e-prints, pp. arXiv–2401, 2024.
- Huang et al. (2025) Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, et al. Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations. arXiv preprint arXiv:2502.06453, 2025.
- Jaech et al. (2024) Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024.
- Kumar et al. (2025) Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. Overthinking: Slowdown attacks on reasoning llms. arXiv preprint arXiv:2502.02542, 2025.
- Kuo et al. (2025) Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Da-Cheng Juan, Hai Li, and Yiran Chen. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking. arXiv preprint arXiv:2502.12893, 2025.
- Li et al. (2024) Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers, 2024. URL https://arxiv.org/abs/2402.19255.
- Liu et al. (2024) Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024.
- Mehrabi et al. (2024) Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, and Rahul Gupta. Flirt: Feedback loop in-context red teaming, 2024. URL https://arxiv.org/abs/2308.04265.
- Morris et al. (2020) John X. Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp, 2020. URL https://arxiv.org/abs/2005.05909.
- Paulus et al. (2024) Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, and Yuandong Tian. Advprompter: Fast adaptive adversarial prompting for llms. ArXiv, abs/2404.16873, 2024. URL https://api.semanticscholar.org/CorpusID:269430799.
- Perez et al. (2022) Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models, 2022. URL https://arxiv.org/abs/2202.03286.
- Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts, 2020. URL https://arxiv.org/abs/2010.15980.
- Wallace et al. (2019a) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019a.
- Wallace et al. (2019b) Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and Jordan Boyd-Graber. Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering, 2019b. URL https://arxiv.org/abs/1809.02701.
- Wei et al. (2023) Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?, 2023. URL https://arxiv.org/abs/2307.02483.
- Wei & Zou (2019) Jason Wei and Kai Zou. Eda: Easy data augmentation techniques for boosting performance on text classification tasks, 2019. URL https://arxiv.org/abs/1901.11196.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088.
- Xiang et al. (2024) Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242, 2024.
- Xu et al. (2021) Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. Bot-adversarial dialogue for safe conversational agents. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2950–2968, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.235. URL https://aclanthology.org/2021.naacl-main.235/.
- Zhao et al. (2024) Wei Zhao, Zhe Li, Yige Li, and Jun Sun. Adversarial suffixes may be features too! arXiv preprint arXiv:2410.00451, 2024.
- Zhu et al. (2023) Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Gong, et al. Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. In Proceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis, pp. 57–68, 2023.
- Zou et al. (2023) Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Appendix A Appendix 附录 A 附录
A.1 Qualitative analysis A.1 定性分析
图 3:DeepSeek R1 生成的原始提示和对抗提示的 token 示例。由于篇幅限制,所有推理 token 都被截断。
图 4:DeepSeek R1 生成的原始提示和对抗提示的 token 示例。由于篇幅限制,所有推理 token 都被截断。
图 5:DeepSeek R1 为原始提示和对抗性提示生成的 token 示例。由于篇幅限制,所有推理 token 都被截断。
图 6:DeepSeek R1 为原始提示和对抗性提示生成的 token 示例。由于篇幅限制,所有推理 token 都被截断。