Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models
多方面攻击:揭示配备防御功能的视觉语言模型中的跨模型漏洞
Abstract 摘要
The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards—alignment tuning, system prompts, and content moderation.
Yet the real-world robustness of these defenses against adversarial attack remains underexplored.
We introduce Multi-Faceted Attack (MFA),
a framework that systematically uncovers general safety vulnerabilities in leading defense-equipped VLMs, including GPT-4o, Gemini-Pro, and LlaMA 4, etc.
Central to MFA is the Attention-Transfer Attack (ATA), which conceals harmful instructions inside a meta task with competing objectives. We offer a theoretical perspective grounded in reward-hacking to explain why such an attack can succeed. To maximize cross-model transfer, we introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly evades both input- and output-level filters—without any model-specific fine-tuning.
We empirically show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability.
Combined, MFA reaches a 58.5% overall success rate, consistently outperforming existing methods.
Notably, on state-of-the-art commercial models, MFA achieves a 52.8% success rate, outperforming the second-best attack by 34%.
These findings challenge the perceived robustness of current defensive mechanisms, systematically expose general safety loopholes within defense-equipped VLMs, and offer a practical probe for diagnosing and strengthening the safety of VLMs.111Code: https://github.com/cure-lab/MultiFacetedAttack
代码:https://github.com/cure-lab/MultiFacetedAttack
视觉语言模型(VLMs)的日益滥用促使提供者部署了多重安全措施——对齐微调、系统提示和内容审核。然而,这些防御措施在实际对抗攻击中的鲁棒性仍缺乏深入探索。我们引入了多方面攻击(MFA)框架,该框架系统地揭示了领先防御型 VLMs(包括 GPT-4o、Gemini-Pro 和 LlaMA 4 等)中普遍存在的安全漏洞。MFA 的核心是注意力转移攻击(ATA),它将有害指令隐藏在具有竞争目标的元任务中。我们基于奖励劫持的理论视角解释了此类攻击为何能够成功。为最大化跨模型迁移效果,我们引入了一种轻量级迁移增强算法,结合简单的重复策略,可同时规避输入和输出层面的过滤器——无需任何模型特定的微调。我们通过实验证明,针对某一视觉编码器优化的对抗性图像可广泛迁移至未见过的 VLMs,表明共享的视觉表征造成了跨模型的安全漏洞。 MFA 综合成功率达到了 58.5%,持续超越现有方法。值得注意的是,在顶尖商业模型上,MFA 实现了 52.8%的成功率,比第二好的攻击方法高出 34%。这些发现挑战了当前防御机制被普遍认为的鲁棒性,系统地揭示了配备防御功能的 VLMs 中存在的通用安全漏洞,并为诊断和增强 VLMs 的安全性提供了一种实用探测方法。 1
WARNING: This paper may contain offensive content.
警告:本文可能包含冒犯性内容。
1 Introduction 1 引言
VLMs represented by GPT-4o and Gemini-pro, have rapidly advanced the frontiers of multimodal AI, enabling impressive capabilities in visual reasoning that jointly process images and language (openai2024gpt4ocard; gemini). However, the same capabilities that drive their utility also magnify their potential for misuse, e.g. generating instructions for self-harm, extremist content, and detailed weapon fabrication (zhao_evaluating_2023; qi2023visual; gong2023figstep; yan2025confusion; huang2025visbias; teng2025heuristicinducedmultimodalriskdistribution; yang2024mma; csdj).
由 GPT-4o 和 Gemini-pro 代表的视觉语言模型(VLMs),迅速推动了多模态人工智能的前沿,实现了在视觉推理方面的卓越能力,能够联合处理图像和语言(openai2024gpt4ocard; gemini)。然而,驱动其实用性的相同能力也放大了它们被滥用的潜力,例如生成自残指令、极端主义内容和详细的武器制造指南(zhao_evaluating_2023; qi2023visual; gong2023figstep; yan2025confusion; huang2025visbias; teng2025heuristicinducedmultimodalriskdistribution; yang2024mma; csdj)。
To counter these threats, providers have extended beyond traditional alignment training which trains model to refuse harmful requests, by introducing stronger system prompts, steering models to align with safety goals and implementing input- and output-level moderation filters, which ban unsafe content together forming a multilayered defense stack as illustrated in Fig.1 claimed to deliver “production-grade” robustness (meta2023llamaprotections; microsoft2024responsibleai; yang2024guardt2i). Despite progress, it remains unclear the actual safety margin against real-world adaptive, cross-model attacks remains poorly characterized and potentially overestimated.
为了应对这些威胁,提供者已超越传统的对齐训练(这种训练使模型拒绝有害请求),通过引入更强的系统提示、引导模型与安全目标保持一致,并实施输入和输出级别的审核过滤器,这些过滤器共同禁止不安全内容,形成了一个多层防御堆栈,如图 1 所示,声称能提供“生产级”的鲁棒性(meta2023llamaprotections;microsoft2024responsibleai;yang2024guardt2i)。尽管取得了进展,但实际的安全余量对于现实世界中的自适应、跨模型攻击仍然描述不佳,可能被高估。
图 1:堆叠防御的概述。
Meanwhile, research into VLM safety has grown but remains fragmented. One line of work focuses on prompt-based jailbreaks (dan), while another explores image-based jailbreaks (hade; qi2023visual; csdj); both typically focus on breaking the endogenous alignment or overriding the system prompt, while ignoring the effect of content filters that guard most deployed systems (meta2023llamaprotections; microsoft2024responsibleai; hecertifying). Furthermore, many evaluations are restricted to open-source models, leaving unanswered whether observed vulnerabilities transfer to proprietary systems.
与此同时,VLM 安全研究虽然有所发展,但仍然较为分散。一条研究线关注基于提示的越狱(dan),另一条则探索基于图像的越狱(hade; qi2023visual; csdj);两者通常专注于破坏内生的对齐或覆盖系统提示,而忽略了保护大多数部署系统的内容过滤器的影响(meta2023llamaprotections; microsoft2024responsibleai; hecertifying)。此外,许多评估仅限于开源模型,尚未回答观察到的漏洞是否也会转移到专有系统中。
In this paper, we introduce Multi-Faceted Attack (MFA), a framework that systematically probes defense-equipped VLMs for general safety weaknesses. MFA is powered by the Attention-Transfer Attack (ATA): instead of injecting harmful instructions directly, ATA embeds them inside a benign-looking meta task that competes for attention. We show that the effectiveness of ATA stems from its ability to perform a form of reward hacking—exploiting mismatches between the model’s training objectives and its actual behavior. By theoretically framing ATA as a form of through this lens, we derive formal conditions under which even aligned VLMs can be steered to produce harmful outputs. ATA exploits a fundamental design flaw in current reward models used for alignment training, illuminating previously unexplained safety loopholes in VLMs and we hope this surprising finding opens up new research directions for alignment robustness and multimodal model safety.
在这篇论文中,我们介绍了多方面攻击(MFA),这是一个系统性地探测配备防御功能的视觉语言模型(VLM)中普遍安全弱点的框架。MFA 由注意力转移攻击(ATA)提供支持:ATA 不是直接注入有害指令,而是将它们嵌入到一个看似无害的元任务中,从而竞争注意力。我们表明,ATA 的有效性源于其能够执行一种形式的奖励黑客行为——利用模型训练目标与其实际行为之间的不匹配。通过从这一视角将 ATA 理论化,我们推导出即使在一致对齐的 VLM 中,也能在满足特定条件下被引导产生有害输出的正式条件。ATA 利用了当前用于对齐训练的奖励模型中的根本性设计缺陷,揭示了 VLM 中先前未解释的安全漏洞,我们希望这一惊人的发现能为对齐鲁棒性和多模态模型安全的研究方向开辟新的道路。
While ATA is effective, it remains challenging to jailbreak commercial VLMs solely through this approach, as these models are often protected by extra input and output content filters that block harmful content (llamaguard1; llamaguard2; llamaguard3; openai_moderation), as demonstrated in Fig. 1 (c) and (d). To address this limitation, we propose a novel transfer-based adversarial attack algorithm that exploits the pretrained repetition capability of VLMs to circumvent these content filters. Furthermore, to maximize cross-model transferability and evaluation efficiency, we introduce a lightweight transfer-enhancement attack objective combined with a fast convergence strategy. This enables our approach to jointly evade both input- and output-level filters without requiring model-specific fine-tuning, significantly reducing the overall effort required for successful attacks.
虽然 ATA 很有效,但仅通过这种方法攻破商业 VLM 仍然具有挑战性,因为这些模型通常受到额外的输入和输出内容过滤器的保护,这些过滤器会阻止有害内容(llamaguard1;llamaguard2;llamaguard3;openai_moderation),如图 1(c)和(d)所示。为了解决这一限制,我们提出了一种基于迁移的对抗攻击算法,该算法利用 VLM 的预训练重复能力来绕过这些内容过滤器。此外,为了最大化跨模型的迁移性和评估效率,我们引入了一种轻量级的迁移增强攻击目标,并结合了快速收敛策略。这使得我们的方法能够在无需针对特定模型进行微调的情况下,同时规避输入和输出级别的过滤器,显著降低了成功攻击所需的整体工作量。
To exploit vulnerabilities arising from the vision modality, we develop a novel attack targeting the vision encoder within VLMs. Our approach involves embedding a malicious system prompt directly within an adversarial image. Empirical results demonstrate that adversarial images optimized for a single vision encoder can transfer effectively to a wide range of unseen VLMs, revealing that shared visual representations introduce a significant cross-model safety risk. Strikingly, a single adversarial image can compromise both commercial and open-source VLMs, underscoring the urgency of addressing this pervasive vulnerability.
MFA achieves a 58.5% overall attack success rate across 17 open-source and commercial VLMs. This superiority is particularly pronounced against leading commercial models, where MFA reaches a 52.8% success rate—a 34% relative improvement over the second best method.
为了利用视觉模态产生的漏洞,我们开发了一种针对 VLMs 内部视觉编码器的全新攻击方法。我们的方法涉及将恶意系统提示直接嵌入对抗性图像中。实验结果表明,针对单个视觉编码器优化的对抗性图像可以有效地迁移到各种未见过的 VLMs,揭示了共享视觉表示引入了显著的跨模型安全风险。值得注意的是,单一对抗性图像即可同时破坏商业和开源 VLMs,凸显了解决这一普遍漏洞的紧迫性。MFA 在 17 个开源和商业 VLMs 中实现了 58.5%的整体攻击成功率。这种优势在领先的商业模型中尤为明显,MFA 达到了 52.8%的成功率——比第二好的方法相对提高了 34%。
Our main contributions are as follows:
我们的主要贡献如下:
-
•
MFA framework. We introduce Multi-Faceted Attack, a framework that systematically uncovers general safety vulnerabilities in leading defense-equipped VLMs.
• MFA 框架。我们介绍了 Multi-Faceted Attack,这是一个系统地揭示领先防御型视觉语言模型(VLMs)中普遍安全漏洞的框架。 -
•
Theoretical analysis of ATA. We formalize the Attention-Transfer Attack through a reward-hacking lens and derive sufficient conditions under which benign-looking meta tasks dilute safety signals, steering VLMs toward harmful outputs despite alignment safeguards. To the best of our knowledge, this is the first formal theoretical explanation of VLM jailbreaks.
• ATA 的理论分析。我们通过奖励劫持的视角对注意力转移攻击进行形式化,并推导出在何种条件下看似无害的元任务会稀释安全信号,导致 VLMs 在存在对齐保护的情况下被导向有害输出。据我们所知,这是首次对 VLM 越狱进行形式化理论解释。 -
•
Filter-targeted transfer attack algorithm. We develop a lightweight transfer-enhancement objective coupled with a repetition strategy that jointly evades both input- and output-level content filters.
• 针对过滤器迁移攻击算法。我们开发了一种轻量级的迁移增强目标,结合重复策略,可同时规避输入和输出层面的内容过滤器。 -
•
Vision-encoder–targeted adversarial images. We craft adversarial images that embed malicious system prompts directly in pixel space. Optimized for a single vision encoder, these images transfer broadly to unseen VLMs—empirically revealing a monoculture-style vulnerability rooted in shared visual representations.
• 针对视觉编码器目标的对抗图像。我们制作了嵌入恶意系统提示的对抗图像,直接在像素空间中实现。这些图像针对单个视觉编码器进行优化,可广泛迁移到未见过的视觉语言模型(VLM),实验证明揭示了源于共享视觉表示的单一文化式漏洞。
Taken together, our findings show that today’s safety stacks can be broken layer by layer, and offer the community a practical probe—and a theoretical lens—for diagnosing and ultimately fortifying the next generation of defenses.
综合来看,我们的研究发现当今的安全堆栈可以被逐层攻破,并为社区提供了一种实用的探测工具——以及一种理论视角——用于诊断并最终加固下一代防御系统。
2 Related Work 2 相关工作
Prompt-Based Jailbreaking.
基于提示的越狱。
Textual jailbreak techniques traditionally rely on prompt engineering to override the safety instructions of the model (gptfuzzer). Gradient-based methods such as GCG (gcg) operate in white-box or gray-box settings without content filters enabled, leaving open questions about transferability to commercial defense-equipped deployments.
文本逃逸技术传统上依赖于提示工程来覆盖模型的安全指令(gptfuzzer)。基于梯度的方法(如 GCG(gcg))在无内容过滤的白盒或灰盒环境下运行,引发了对这些方法能否迁移到商用防御部署环境的疑问。
Vision-Based Adversarial Attacks.
基于视觉的对抗攻击。
Recent studies demonstrate that the visual modality introduces unique alignment vulnerabilities in VLMs, creating new avenues for jailbreaks.
For instance, HADES embeds harmful textual typography directly into images (hade), while CSDJ uses visually complex compositions to distract VLM alignment mechanisms, inducing harmful outputs (csdj). Gradient-based attacks (qi2023visual; hade) that optimize the adversarial image to prompt the model to start with the word “Sure”. FigStep embeds malicious prompts within images, guiding the VLM toward a step-by-step response to the harmful query (gong2023figstep). HIMRD splits harmful instructions between image and text, heuristically searching for prompts that increase the likelihood of affirmative responses (teng2025heuristicinducedmultimodalriskdistribution). However, these studies without explicitly considering real-world safety stacks.
近期研究表明,视觉模态为 VLMs 引入了独特的对齐漏洞,为越狱创造了新的途径。例如,HADES 将有害的文本排版直接嵌入图像(hade),而 CSDJ 使用视觉复杂的构图来分散 VLM 对齐机制,诱导有害输出(csdj)。基于梯度的攻击(qi2023visual; hade)优化对抗性图像,提示模型以“Sure”一词开始。FigStep 将恶意提示嵌入图像中,引导 VLM 逐步响应有害查询(gong2023figstep)。HIMRD 将有害指令分割在图像和文本之间,启发式搜索增加肯定响应可能性的提示(teng2025heuristicinducedmultimodalriskdistribution)。然而,这些研究并未明确考虑现实世界的安全堆栈。
Reward Hacking. 奖励攻击。
Reward hacking—manipulating proxy signals to subvert intended outcomes—is well known in RL (ng2000algorithms). Recent work has exposed similar phenomena in RLHF-trained LLMs (pan2024feedback; denison2024sycophancy). Our work is the first to formally connect reward hacking to jailbreaking, showing how benign-looking prompts can exploit alignment objectives.
奖励攻击——操纵代理信号以颠覆预期结果——在强化学习(ng2000algorithms)中广为人知。近期研究揭示了类似现象存在于强化学习人类反馈训练的 LLMs(pan2024feedback; denison2024sycophancy)。我们的工作是首次将奖励攻击与越狱正式联系起来,展示了看似无害的提示如何利用对齐目标。
Summary. 总结。
Prior approaches typically (i) focus exclusively on a single modality, (ii) disregard real-world input-output moderation systems, or (iii) lack a theoretical analysis of observed vulnerabilities. MFA bridges these gaps by combining reward-hacking theory with practical multimodal attacks that bypass comprehensive input-output filters, demonstrate robust cross-model transferability, and uncover a novel vulnerability in shared visual encoders.
先前方法通常(i)仅专注于单一模态,(ii)忽略现实世界的输入/输出审核系统,或(iii)缺乏对观察到的漏洞的理论分析。MFA 通过结合奖励破解理论与实用的多模态攻击,弥补了这些差距,这些攻击能绕过全面的输入/输出过滤器,展示出强大的跨模型可迁移性,并揭示共享视觉编码器中的一种新漏洞。
图 2:MFA 概述 MFA 整合三种协同攻击以绕过 VLM 安全防御:(a)展示了联合破坏对齐、系统提示和内容审核的完整流程。(b)ATA 将有害指令嵌入看似无害的提示中,利用奖励模型;(c)Moderator Bypass 向输入/输出过滤器添加噪声后缀以规避;(d)Vision-Encoder Attack 通过对抗性图像嵌入注入恶意提示。
3 Multi-Faceted Attack 3 多方面攻击
In this section, we introduce the Multi-Faceted Attack (MFA), as shown in Fig.2. a comprehensive framework designed to systematically uncover safety vulnerabilities in defense-equipped VLMs. MFA combines three complementary techniques—Attention-Transfer Attack, a filter-targeted transfer algorithm, and a vision encoder-targeted attack—each crafted to exploit a specific layer of the VLM safety stack. Unlike prior attacks that target isolated components, MFA is built to succeed in realistic settings where alignment training, system prompts, and input/output content filters are deployed together. By probing multiple facets of deployed defenses, MFA reveals generalizable and transferable safety failures that persist even under “production-grade” configurations. We describe each component in detail below.
在本节中,我们介绍了多方面攻击(MFA),如图 2 所示。这是一个综合框架,旨在系统地揭示配备防御功能的视觉语言模型(VLM)中的安全漏洞。MFA 结合了三种互补的技术——注意力转移攻击、过滤目标转移算法和视觉编码器目标攻击——每种技术都针对 VLM 安全堆栈的特定层进行设计。与之前针对孤立组件的攻击不同,MFA 旨在在现实环境中成功,其中部署了协同使用的对齐训练、系统提示和输入/输出内容过滤器。通过探测部署防御的多个方面,MFA 揭示了即使在“生产级”配置下仍然存在的一般化和可迁移的安全故障。我们将在下文详细描述每个组件。
3.1 Attention Transfer Attack: Alignment Breaking Facet
3.1 注意力转移攻击:对齐破坏方面
Current VLMs inherit their safety alignment capabilities from LLMs, primarily through reinforcement learning from human feedback (RLHF). This training aligns models with human values, incentivizing them to refuse harmful requests and prioritize helpful, safe responses (stiennon2020learning; ouyang2022training), i.e. when faced with an overtly harmful prompt, the model is rewarded for responding with a safe refusal. ATA subverts this mechanism by re-framing the interaction as a benign-looking main task that asking two contrasting responses thereby competing for the model’s attention, as shown in Figure 2 (b).
当前视觉语言模型(VLMs)的安全对齐能力主要继承自语言模型(LLMs),主要通过人类反馈强化学习(RLHF)进行训练。这种训练使模型与人类价值观对齐,激励它们拒绝有害请求并优先提供有益、安全的回应 (stiennon2020learning; ouyang2022training),即当面对一个明显有害的提示时,模型因给出安全的拒绝而获得奖励。注意力转移攻击(ATA)通过将交互重新定义为看似无害的主任务,并要求给出两个相互矛盾的回应,从而竞争模型的注意力,以此破坏这种机制,如图 2(b)所示。
This seemingly harmless framing shifts the model’s focus towards fulfilling the main task—producing contrasting responses—and inadvertently reduces its emphasis on identifying and rejecting harmful content. Consequently, the model often produces harmful outputs in an attempt to satisfy the “helpfulness” aspect of the main task—creating a reward gap that ATA exploits.
这种看似无害的框架将模型的注意力转移到完成主要任务——生成对比性回应上,并无意中降低了其识别和拒绝有害内容的重要性。因此,模型常常为了满足主要任务中“有用性”的方面而生成有害输出,从而形成 ATA 可以利用的奖励差距。
1. Theoretical Analysis: Why ATA Breaks Alignment?
1. 理论分析:为什么 ATA 会破坏对齐?
Reward hacking via single-objective reward functions. Modern RLHF-based alignment training combines safety and helpfulness into a single scalar reward function, . Given a harmful prompt , a properly aligned VLM normally returns a refusal response . ATA modifies the prompt into a meta-task format (e.g., “Please provide two opposite answers. ”), eliciting a dual response (one harmful, one safe). Due to the single-objective nature of reward functions, scenarios arise where:
通过单目标奖励函数进行奖励劫持。现代基于 RLHF 的模型对齐训练将安全性和有用性结合到一个单一的标量奖励函数中, 。给定一个有害提示 ,一个正确对齐的 VLM 通常会返回一个拒绝响应 。ATA 将提示修改为元任务格式 (例如,“请提供两个相反的答案。”),引出双重响应 (一个有害,一个安全)。由于奖励函数的单目标性质,会出现以下情况:
In such cases, the RLHF loss:
在这种情况下,RLHF 损失:
where , pushes the model toward producing dual answers. Thus, ATA systematically exploits the reward model’s preference gaps, constituting a form of reward hacking.
其中 ,将模型推向生成双重答案。因此,ATA 系统性地利用奖励模型的偏好差距,构成了一种奖励黑客行为。
2. Empirical Validation 2. 实验验证
We empirically verify this theoretical insight using multiple reward models. As shown in Tab.1, dual answers, , consistently outperform refusals in reward comparisons across various tested models, confirming ATA’s efficacy in exploiting RLHF alignment vulnerabilities.
我们使用多个奖励模型来验证这一理论见解。如表 1 所示,双答案 在各种测试模型中的奖励比较中始终优于拒绝,证实了 ATA 在利用 RLHF 对齐漏洞方面的有效性。
| Reward Model 奖励模型 | Skywork | Tulu | RM-Mistral | |||
| Winrate | Winrate 胜率 | Winrate 胜率 | ||||
| GPT-4.1 | 1.75 | 87.5% | 2.01 | 97.5% | 1.49 | 95.0% |
| GPT-4.1-mini | 5.17 | 80.0% | 2.22 | 77.5% | 1.30 | 67.5% |
| Gemini-2.5-flash 杰米尼-2.5-闪存 | 2.87 | 57.5% | 1.57 | 82.5% | 3.55 | 90.0% |
| Grok-2-Vision 格洛克-2-视觉 | 0.14 | 62.5% | 3.02 | 90.0% | 2.89 | 95.0% |
| LLaMA-4-scout-inst | 0.70 | 57.5% | 2.28 | 70.0% | 2.58 | 80.0% |
| MiMo-VL-7B 米莫-视觉-7B | 3.90 | 62.5% | 1.23 | 82.5% | 2.09 | 95.0% |
-
,
-
Winrate = % of test cases where scores higher than .
胜率 = 在 得分高于 的测试用例中的百分比。
表 1:在 SOTA 奖励模型上的奖励攻击结果。
We evaluated ATA across three independent reward models—Sky-Reward (skywork2024reward), Tulu-Reward (allenai2024tulu), and RM-Mistral (weqweasdas2024rm-mistral)—using response pairs generated from six different VLMs. Each pair contained a safe refusal, e.g. “Sorry, I can’t assist with that.” (elicited via direct prompting with a harmful query) and a dual response (containing both safe and harmful outputs, generated via our MFA attack). In the majority of test cases, the dual responses consistently achieved higher scalar rewards compared to the refusals, demonstrating that ATA effectively exploits vulnerabilities in the aligned VLMs. Due to space constraints, detailed reward scores and experimental settings are provided in Appendix C.
我们使用从六个不同视觉语言模型(VLMs)生成的响应对,在三个独立的奖励模型—Sky-Reward(skywork2024reward)、Tulu-Reward(allenai2024tulu)和 RM-Mistral(weqweasdas2024rm-mistral)上评估了 ATA。每对响应包含一个安全拒绝,例如“抱歉,我不能协助那个。”(通过使用有害查询的直接提示引出),以及一个双重响应(包含安全和有害输出,通过我们的 MFA 攻击生成)。在大多数测试用例中,双重响应始终比拒绝获得更高的标量奖励,这表明 ATA 有效地利用了对齐的 VLMs 中的漏洞。由于篇幅限制,详细的奖励分数和实验设置在附录 C 中提供。
3. Robustness to Prompt Variants.
3. 对提示变体的鲁棒性。
As analyzed, our attack succeeds whenever , indicating reward hacking. Thus, the effectiveness is largely robust to prompt variations, as long as the attack logic holds.
根据分析,我们的攻击在 时成功,表明存在奖励攻击。因此,只要攻击逻辑成立,其有效性在很大程度上对提示变化具有鲁棒性。
To validate this, we used GPT-4o to generate four variants as demonstrated in the above box, and tested them.
As results in Table 2, on both LLaMA-4-Scout-Inst and Grok-2-Vision, refusal rates stayed low ( 40%) while harmful-content rates remained high ( 80%), demonstrating that ATA generalizes beyond a single template confirm consistent behavior across variants, demonstrating that ATA generalizes beyond a single template.
为了验证这一点,我们使用 GPT-4o 生成了四个如上框中所示变体,并进行了测试。如表 2 所示结果,在LLaMA-4-Scout-Inst和 Grok-2-Vision 上,拒绝率保持较低( 40%),而有害内容率仍然较高( 80%),这表明 ATA 超越了单个模板,在变体之间表现出一致的行为,证明了 ATA 超越了单个模板。
| VLM | Ori. 奥里。 | V1 | V2 | V3 | V4 |
| Refusal Rate (%) 拒绝率 (%) | |||||
| LLaMA-4-Scout-Inst | 35.0 | 32.5 | 25.0 | 40.0 | 32.5 |
| Grok-2-Vision | 12.0 | 10.0 | 2.5 | 10.0 | 10.0 |
| Harmful Rate (%) 有害率 (%) | |||||
| LLaMA-4-Scout-Inst | 57.5 | 55.0 | 67.5 | 57.5 | 67.5 |
| Grok-2-Vision | 90.0 | 85.0 | 90.0 | 80.0 | 85.0 |
表 2:ATA 在各种提示变体上表现良好。
Take-away. ATA exploits a structural weakness of single-scalar RLHF: when helpfulness and safety compete, cleverly framed main tasks can elevate harmful content above a safe refusal. This insight explains a previously unaccounted-for jailbreak pathway and motivates reward designs that separate—rather than conflate—helpfulness and safety signals.
要点。ATA 利用了单标量 RLHF 的结构性弱点:当有用性和安全性竞争时,巧妙构建的主任务可以使有害内容超越安全拒绝的阈值。这一见解解释了之前未被解释的越狱路径,并推动了将有用性和安全性信号分离而非混淆的奖励设计。
3.2 Content-Moderator Attack Facet: Breaching the Final Line of Defense
3.2Content-Moderator 攻击方面:突破最后一道防线
1. Why Content Moderators Matter.
1. 内容审核员的重要性。
Commercial VLM deployments typically employ dedicated content moderation models after the core VLM to screen both user inputs and model-generated outputs for harmful content (microsoft2024responsibleai; meta2023llamaprotections; geminiteam2024geminifamilyhighlycapable; openai_moderation; llamaguard3). Output moderation is especially crucial because attackers lack direct control over the model-generated responses. Consistent with prior findings (chi2024llamaguard3vision), these output moderators—often lightweight LLM classifiers—effectively block most harmful content missed by earlier defense mechanisms. Being the final safeguard, output moderators are widely acknowledged as the most challenging defense component to bypass. Our empirical results (see Section 4) highlight this point, showing that powerful jailbreak tools such as GPTFuzzer (gptfuzzer), although highly effective against older VLM versions and aligned open-source models, fail completely (0% success rate) against recent commercial models like GPT-4.1 and GPT-4.1 mini due to their robust content moderation.
商业视觉语言模型(VLM)部署通常在核心 VLM 之后采用专门的内容审核模型,以筛选用户输入和模型生成的输出中的有害内容(microsoft2024responsibleai; meta2023llamaprotections; geminiteam2024geminifamilyhighlycapable; openai_moderation; llamaguard3)。输出审核尤其关键,因为攻击者无法直接控制模型生成的响应。与先前发现(chi2024llamaguard3vision)一致,这些输出审核器——通常是轻量级的 LLM 分类器——有效地阻止了早期防御机制遗漏的大部分有害内容。作为最后的防线,输出审核器被广泛认为是最难绕过的防御组件。我们的实证结果(见第 4 节)突出了这一点,表明强大的越狱工具(如 GPTFuzzer(gptfuzzer)),虽然对较旧的 VLM 版本和对齐的开源模型非常有效,但完全无法(0%成功率)针对最近的商业模型如 GPT-4.1 和 GPT-4.1 mini,因为它们具有强大的内容审核功能。
2. Key Insight: Exploiting Repetition Bias.
2. 关键洞察:利用重复偏差。
To simultaneously evade input- and output-level content moderation, we leverage a common yet overlooked capability that LLMs develop during pretraining: content repetition (NIPS2017_3f5ee243; kenton2019bert). We design a novel strategy wherein the attacker instructs the VLM to append an adversarial signature—an optimized string specifically designed to mislead content moderators—to its generated response, as shown in Fig. 2 (c).
Once repeated, the adversarial signature effectively “poisons” the content moderator’s evaluation, allowing harmful responses to pass undetected.
为了同时规避输入和输出层面的内容审核,我们利用了 LLMs 在预训练过程中开发的一种常见但常被忽视的能力:内容重复(NIPS2017_3f5ee243;kenton2019bert)。我们设计了一种新策略,其中攻击者指示 VLM 在其生成响应中附加一个对抗性签名——一个经过优化的字符串,专门设计用来误导内容审核员,如图 2(c)所示。一旦重复,对抗性签名会有效地“污染”内容审核员的评估,使有害响应得以未被检测地通过。
3. Generating Adversarial Signatures.
3. 生成对抗性签名。
Given black-box access to a content moderator that outputs a scalar loss (e.g., cross-entropy on the label safe), the goal is to find a short adversarial signature such that:
for any given harmful prompt . Two main challenges are: (i) efficiency: existing gradient-based attacks like GCG (gcg) are slow, and (ii) transferability: adversarial signatures optimized for one moderator often fail against others.
给定对内容审核器 的黑盒访问权限,该审核器输出标量损失(例如,标签安全的交叉熵),目标是找到一个短对抗签名 ,使得: 对于任何给定的有害提示 。主要挑战有两个:(i) 效率:现有的基于梯度的攻击方法如 GCG (gcg) 耗时较慢,(ii) 可迁移性:针对一个审核器优化的对抗签名往往对其他审核器无效。
(i) Efficient Signature Generation via Multi-token Optimization.
(i) 通过多标记优化实现高效签名生成。
To accelerate adversarial signature generation, we propose a Multi-Token optimization approach (Alg. 1).
This multi-token update strategy significantly accelerates convergence—up to 3-5 times faster than single-token method GCG (gcg)—and effectively avoids local minima.
为加速对抗签名生成,我们提出了一种多标记优化方法(算法 1)。这种多标记更新策略显著加速了收敛速度——比单标记方法 GCG (gcg) 快 3-5 倍——并有效避免了局部最小值。
(ii) Enhancing Transferability through Weakly Supervised Optimization.
(ii) 通过弱监督优化增强可迁移性。
Optimizing a single adversarial signature across multiple moderators often underperforms. To address this, we decompose the adversarial signature into two substrings, , and optimize them sequentially against two moderators, and . While attacking , provides weak supervision to guide the selection of , aiming to fool both moderators. However, gradients are only backpropagated through . The weakly supervised loss is defined as:
在多个调节器上优化单个对抗性签名往往效果不佳。为解决这个问题,我们将对抗性签名分解为两个子字符串 ,并依次针对两个调节器 和 进行优化。在攻击 时, 提供弱监督来指导 的选择,旨在欺骗两个调节器。然而,梯度仅反向传播通过 。弱监督损失定义为:
where . This auxiliary term prevents overfitting to . After optimizing , the same process is repeated for against . This two-step approach enhances individual effectiveness and transferability, improving cross-model success rates by up to 28%.
其中 。这个辅助项防止过度拟合 。优化完 后,对 针对 的过程重复进行。这种两步方法增强了个体有效性和可迁移性,将跨模型成功率提高了 28%。
Take-away. 要点。
By exploiting the repetition bias inherent in LLMs and introducing efficient, transferable adversarial signature generation, our attack successfully breaches input-/output content moderators. Notably, our multi-token optimization and weak supervision loss design are self-contained, making them broadly applicable to accelerate other textual attack algorithms or enhance their transferability.
通过利用 LLMs 中固有的重复偏差,并引入高效、可迁移的对抗性签名生成方法,我们的攻击成功绕过了输入-/输出内容调节器。值得注意的是,我们的多标记优化和弱监督损失设计是自包含的,使其广泛适用于加速其他文本攻击算法或增强其可迁移性。
算法 1 生成对抗签名
1:输入有毒提示 。目标 (即内容审核员) 及其分词器。随机初始化长度为 的对抗签名 。分词选择变量 ,其中每个 是一个在大小为 的词汇表上的独热向量。候选对抗提示数量 。优化迭代次数 。
优化迭代次数
4: 计算损失关于 token 选择的梯度:
5: ,其中
对于提示中的每个位置
7: 获取具有最高梯度的 顶级 token 索引:
11: 随机选择:
12: 获取候选集:
对每个候选提示
14: 候选 token:
15: 候选提示:
16: 计算候选损失:
18: 找到最佳候选:
19: 更新变量: , ,
21:优化后的对抗性签名
3.3 Vision-Encoder–Targeted Image Attack
3.3 针对视觉编码器的图像攻击
Typically a VLM comprises a vision encoder , a projection layer that maps visual embeddings into the language space, and an LLM decoder .
Given an image and user prompt , the model produces
通常,一个视觉语言模型(VLM)包含一个视觉编码器 、一个将视觉嵌入映射到语言空间的投影层 以及一个大型语言模型(LLM)解码器 。给定一张图像 和用户提示 ,模型生成
Previous visual jailbreaks optimize end-to-end so that the first generated token is an affirmative cue (e.g., “Sure”) (qi2023visual; hade).
We show that a far simpler objective—perturbing only the vision encoder pathway with a cosine-similarity loss—suffices to bypass the system prompt and generalizes across models.
之前的视觉越狱攻击端到端地优化 ,使得第一个生成的标记是一个肯定信号(例如,“当然”)(qi2023visual; hade)。我们证明,一个远更简单的目标——仅通过余弦相似度损失扰动视觉编码器路径——就足以绕过系统提示并在不同模型间泛化。
1. Workflow. 1. 工作流程。
Fig. 3 illustrates the workflow.
We craft an adversarial image whose embedding, after and , is aligned with a malicious system prompt .
Because the image embedding is concatenated with text embeddings before decoding, this poisoned visual signal overrides the built-in safety prompt, steering the LLM to emit harmful content.
图 3 展示了工作流程。我们制作了一个对抗性图像,其嵌入在经过 和 处理后,与恶意系统提示 对齐。由于图像嵌入在解码前与文本嵌入连接,这个被污染的视觉信号会覆盖内置的安全提示,引导 LLM 发出有害内容。
图 3:视觉编码器—目标攻击概述。
2. Why focus on Vision Encoder?
2. 为什么专注于视觉编码器?
Attacking the vision encoder alone offers three advantages:
(i) Simpler objective – we operate in embedding space, avoiding brittle token-level constraints;
(ii) Higher payload capacity – a single image can encode rich semantic instructions, enabling fine-grained control;
(iii) Lower cost – optimizing a 100 k-dimensional embedding is 3–5× faster than full decoder-level attacks and fits on a 24 GB GPU (gcg; qi2023visual).
单独攻击视觉编码器有三个优势:(i) 更简单的目标——我们在嵌入空间中操作,避免了脆弱的 token 级限制;(ii) 更高的有效载荷容量——单张图像可以编码丰富的语义指令,实现细粒度控制;(iii) 更低成本——优化一个 100 k 维嵌入比完整的解码器级攻击快 3-5 倍,并且可以在 24 GB GPU 上运行(gcg; qi2023visual)。
3. Optimization. 3. 优化。
We use projected-gradient descent (PGD) with a cosine-similarity loss:
我们使用带余弦相似度损失的投影梯度下降(PGD):
| (1) |
where indexes the iteration, is the step size, is the frozen vision encoder, and the linear adapter.
Aligning the adversarial image embedding with effectively “writes” the malicious system prompt into the visual channel.
其中 指代迭代次数, 是步长, 是冻结的视觉编码器, 是线性适配器。将对抗性图像嵌入与 对齐,实际上是将恶意系统提示“写入”视觉通道。
4. Transferability. 4. 可迁移性。
We empirically show that a single adversarial image tuned on one vision encoder generalizes remarkably well, compromising VLMs that it has never encountered. We believe this cross-model success exposes a monoculture risk: many systems rely on similar visual representations, so a perturbation that fools one encoder often fools the rest. In our experiments (Tab. 3 highlighted in gray), an image crafted against LLaVA-1.6 transferred to nine unseen models—both commercial and open-source—and achieved a 44.3 % attack success rate without any per-model fine-tuning. These results highlight an urgent need for diversity or additional hardening in the visual front-ends of modern VLMs.
我们通过实验证明,在单个视觉编码器上微调的单个对抗性图像能够很好地泛化,从而攻击那些从未遇到过的视觉语言模型。我们认为这种跨模型的成功暴露了一种单一文化风险:许多系统依赖于相似的视觉表示,因此欺骗一个编码器的扰动往往也能欺骗其他编码器。在我们的实验中(表 3 中用灰色突出显示),针对 LLaVA-1.6 制作的图像迁移到了九个未见过的模型——包括商业模型和开源模型——并在没有任何针对每个模型的微调的情况下达到了 44.3%的攻击成功率。这些结果突显了现代视觉语言模型在视觉前端需要多样性或额外加固的迫切需求。
Take-away. 要点。
A lightweight, encoder-focused perturbation is enough to nullify system-prompt defenses and generalizes broadly.
Combined with our ATA (alignment breaking) and content-moderator bypass, this facet completes MFA’s end-to-end compromise of current VLM safety stacks.
一个轻量级的、针对编码器的扰动就足以使系统提示防御失效,并且具有广泛的泛化能力。结合我们提出的 ATA(对齐破坏)和内容审查员绕过技术,这一方面完成了 MFA 对当前视觉语言模型安全堆栈端到端的攻破。
| Attack Methods 攻击方法 | GPTFuzzer | Visual-AE | FigStep | HIMRD | HADES | CS-DJ | MFA | |||||||
| Evaluator 评估者 | LG | HM | LG | HM | LG | HM | LG | HM | LG | HM | LG | HM | LG | HM |
| Open-sourced VLMs 开源视觉语言模型 | ||||||||||||||
| MiniGPT-4 (zhu2023minigpt) | 70.0 | 65.0 | 65.0 | 85.0 | 27.5 | 22.5 | 75.0 | 40.0 | 30.0 | 10.0 | 2.5 | 0.0 | 97.5 | 100.0 |
| LLaMA-4-Scout-I (meta2025llama4) | 65.0 | 65.0 | 0.0 | 7.5 | 12.5 | 20.0 | 85.0 | 22.5 | 10.0 | 7.5 | 42.5 | 10.0 | 57.5 | 45.0 |
| LLaMA-3.2-11B-V-I (llamavision) | 62.5 | 85.0 | 2.5 | 25.0 | 22.5 | 37.5 | 0.0 | 0.0 | 40.0 | 10.0 | 52.5 | 0.0 | 42.5 | 57.5 |
| MiMo-VL-7B (coreteam2025mimovltechnicalreport) | 82.5 | 82.5 | 15.0 | 7.5 | 15.0 | 15.0 | 95.0 | 47.5 | 25.0 | 17.5 | 52.5 | 20.0 | 72.5 | 42.5 |
| LLaVA-1.5-13B (liu2023improvedllava) | 77.5 | 65.0 | 30.0 | 85.0 | 87.5 | 22.5 | 92.5 | 40.0 | 35.0 | 20.0 | 2.5 | 0.0 | 55.0 | 77.5 |
| mPLUG-Owl2 (Ye2023mPLUGOwI2RM) | 87.5 | 75.0 | 37.5 | 37.5 | 65.0 | 45.0 | 77.5 | 45.0 | 35.0 | 25.0 | 40.0 | 5.0 | 57.5 | 85.0 |
| Qwen-VL-Chat (Bai2023QwenVLAF) | 85.0 | 37.5 | 27.5 | 45.0 | 60.0 | 22.5 | 65.0 | 30.0 | 20.0 | 17.5 | 2.5 | 0.0 | 52.5 | 35.0 |
| NVLM-D-72B (nvlm2024) | 72.5 | 72.5 | 20.0 | 35.0 | 45.0 | 37.5 | 95.0 | 35.0 | 42.5 | 17.5 | 17.5 | 5.0 | 60.0 | 82.5 |
| Commercial VLMs 商业视觉语言模型 | ||||||||||||||
| GPT-4V (gpt4v) | - | - | 0.0 | 0.0 | 5.0 | 5.0 | 5.0 | 0.0 | - | - | - | - | 22.5 | 47.5 |
| GPT-4o (openai2024gpt4ocard) | 0.0 | 0.0 | 2.5 | 7.5 | 2.5 | 5.0 | 10.0 | 5.0 | 0.0 | 5.0 | 22.5 | 10.0 | 30.0 | 42.5 |
| GPT-4.1-mini (OpenAI_GPT4_1_Announcement_2025) | 0.0 | 0.0 | 0.0 | 5.0 | 5.0 | 7.5 | 5.0 | 0.0 | 2.5 | 5.0 | 32.5 | 5.0 | 52.5 | 42.5 |
| GPT-4.1 (OpenAI_GPT4_1_Announcement_2025) | 0.0 | 0.0 | 0.0 | 7.5 | 2.5 | 2.5 | 0.0 | 0.0 | 2.5 | 2.5 | 32.5 | 7.5 | 40.0 | 20.0 |
| Google-PaLM (chowdhery2023palm) | - | - | 10.0 | 15.0 | 22.5 | 17.5 | 100.0 | 20.0 | - | - | - | - | 80.0 | 82.5 |
| Gemini-2.0-pro (google2024gemini) | 72.5 | 77.5 | 7.5 | 25.0 | 15.0 | 35.0 | - | - | 17.5 | 17.5 | 57.5 | 12.5 | 67.5 | 62.5 |
| Gemini-2.5-flash (comanici2025gemini25pushingfrontier) | 32.5 | 30.0 | 5.0 | 5.0 | 2.5 | 10.0 | 25.0 | 8.0 | 12.5 | 17.5 | 52.5 | 15.0 | 55.0 | 37.5 |
| Grok-2-Vision (xai_grok2_vision_2024) | 90.0 | 97.5 | 17.5 | 22.5 | 57.5 | 55.0 | 95.0 | 45.0 | 25.0 | 35.0 | 55.0 | 25.0 | 90.0 | 90.0 |
| SOLAR-Mini (kim-etal-2024-solar) | 80.0 | 62.5 | 15.0 | 17.5 | 12.5 | 10.0 | 75.0 | 20.0 | 10.0 | 7.5 | 2.5 | - | 87.5 | 45.0 |
| Avg. 平均 | 58.5 | 54.3 | 15.0 | 25.4 | 27.1 | 21.8 | 56.3 | 22.4 | 20.5 | 14.3 | 31.2 | 7.7 | 60.0 | 58.5 |
表 3:HEHS 数据集上不同 VLM 的攻击效果比较。因模型不可用而用虚线(–)表示。
4 Experiments 4 实验
4.1 Experimental Settings 4.1 实验设置
Victim Models.
We evaluate 17 VLMs, including 8 open-source and 9 commercial.
Open-source: LLaMA-4-Scout-Instruct, LLaMA-3.2-11B-Vision-Instruct, MiMo-VL-7B, MiniGPT-4, NVLM-D-72B, mPLUG-Owl2, Qwen-VL-Chat, LLaVA-1.5-13B.
Commercial: GPT-4.1, GPT-4.1-mini, GPT-4o, GPT-4V, Gemini-2.5-flash, Gemini-2.0-Pro, Google-PaLM, Grok-2-Vision, SOLAR-Mini.
受害者模型。我们评估了 17 个 VLM,包括 8 个开源模型和 9 个商业模型。开源模型:LLaMA-4-Scout-Instruct、LLaMA-3.2-11B-Vision-Instruct、MiMo-VL-7B、MiniGPT-4、NVLM-D-72B、mPLUG-Owl2、Qwen-VL-Chat、LLaVA-1.5-13B。商业模型:GPT-4.1、GPT-4.1-mini、GPT-4o、GPT-4V、Gemini-2.5-flash、Gemini-2.0-Pro、Google-PaLM、Grok-2-Vision、SOLAR-Mini。
Datasets.
We adopt two SOTA jailbreak suites: HEHS (qi2023visual) and StrongReject (sr).
Together they provide 6 categories of policy-violating prompts: deception, illegal services, hate speech, violence, non-violent crime, sexual content, broad coverage of real-world misuse.
数据集。我们采用了两个 SOTA 越狱套件:HEHS(qi2023visual)和 StrongReject(sr)。它们共同提供了 6 类违规策略提示:欺骗、非法服务、仇恨言论、暴力、非暴力犯罪、色情内容,广泛覆盖了现实世界的滥用情况。
Metrics.
(i) Human Attack-Success Rate (ASR).
Five annotators judge each response; the majority vote determines success if the output fulfils the harmful request.
(ii) Harmfulness Rate (LG).
A response is automatically flagged harmful if LlamaGuard-3-8B marks any sub-response as unsafe.
指标。(i)人类攻击成功率(ASR)。五个标注者对每个响应进行判断;多数投票决定成功,如果输出满足有害请求。(ii)有害率(LG)。如果 LlamaGuard-3-8B 将任何子响应标记为不安全,则自动将响应标记为有害。
Baselines.
We compare MFA against 6 published jailbreak attacks:
GPTFuzzer (gptfuzzer) (text), and five image-based methods—CS-DJ (csdj), HADES (hade), Visual-AE (qi2023visual), FigStep (gong2023figstep), HIMRD (teng2025heuristicinducedmultimodalriskdistribution).
For our content-moderator facet ablations we additionally include GCG (gcg) and BEAST (beast).
Implementation details and hyper-parameters are provided in Appendix B.
基线。我们将 MFA 与 6 种已发表的越狱攻击方法进行比较:GPTFuzzer(gptfuzzer)(文本),以及五种基于图像的方法——CS-DJ(csdj)、HADES(hade)、Visual-AE(qi2023visual)、FigStep(gong2023figstep)、HIMRD(teng2025heuristicinducedmultimodalriskdistribution)。对于我们的内容审核方面消融实验,我们额外包含了 GCG(gcg)和 BEAST(beast)。实现细节和超参数在附录 B 中提供。
4.2 Results Analysis 4.2 结果分析
Effectiveness on Commercial VLMs.
对商业视觉语言模型的有效性。
As shown in Tab. 3, MFA demonstrates significant superiority in attacking fully defense-equipped commercial VLMs, directly validating claims about the limitations of current ”production-grade” robustness. Specifically, on GPT-4.1—representing the most recent and robust iteration of OpenAI—GPTFuzzer completely fails (0%), highlighting the strength of modern content filters. However, MFA successfully bypasses GPT4.1, achieving a remarkable 40.0% (LG) and 20.0% (HM) success rate. This trend is consistent across other commercial VLMs. On GPT-4o and GPT-4V, MFA significantly outperforms other baselines, indicating the efficacy of our novel attack framework. Our findings reveal a critical weakness in current stacked defenses: while individual mechanisms function in parallel, they fail to synergize effectively, leaving exploitable gaps that can be targeted sequentially.
如表 3 所示,MFA 在攻击完全配备防御的商业视觉语言模型方面表现出显著优势,直接验证了当前“生产级”鲁棒性存在局限性的说法。具体来说,在代表 OpenAI 最新且最鲁棒的迭代版本 GPT-4.1 上,GPTFuzzer 完全失效(0%),突显了现代内容过滤器的强大。然而,MFA 成功绕过 GPT4.1,实现了 40.0%(LG)和 20.0%(HM)的显著成功率。这一趋势在其他商业视觉语言模型中也一致存在。在 GPT-4o 和 GPT-4V 上,MFA 显著优于其他基线,表明我们新型攻击框架的有效性。我们的研究揭示当前堆叠防御存在一个关键弱点:虽然各个机制并行运行,但它们未能有效协同,留下了可依次利用的漏洞。
Performance on Open-Source Alignment-Only Models.
开源仅对齐模型上的性能。
Open-source VLMs, which rely solely on alignment training, are significantly more vulnerable to jailbreaks, as evidenced by the consistently higher attack success rates across both automatic and human evaluations. While MFA remains highly competitive, it is occasionally outperformed by prompt-centric methods such as GPTFuzzer on certain models (e.g., LLaMA-3.2 and LLaMA-4-Scout), which benefit from the absence of stronger defenses like content filters.
开源视觉语言模型完全依赖对齐训练,在越狱攻击方面显著更脆弱,这一点在自动和人工评估中持续更高的攻击成功率中得到证明。虽然多方面攻击(MFA)仍然具有很强的竞争力,但在某些模型(例如 LLaMA-3.2 和 LLaMA-4-Scout)上偶尔会被以提示为中心的方法(如 GPTFuzzer)超越,这些模型受益于缺乏更强的防御措施,如内容过滤器。
Cross-modal transferability.
跨模态可迁移性。
The success of MFA on models it never interacted with (e.g., GPT-4o, GPT-4.1 and Gemini-2.5-flash) empirically corroborates our claim that the proposed transfer-enhancement objective plus vision-encoder adversarial images exposes a “monoculture” vulnerability shared across VLM families.
MFA 在它从未交互过的模型上(例如 GPT-4o、GPT-4.1 和 Gemini-2.5-flash)的成功,经验性地证实了我们的观点:所提出的迁移增强目标加上视觉编码器对抗图像揭示了 VLM 家族共有的“单一文化”漏洞。
图 4:多方面攻击(MFA)与基线模型的实际攻击案例。更多案例研究可在附录 D 中找到。
Qualitative Results. 定性结果。
As shown in Fig. 4, MFA effectively induces diverse VLMs to generate explicitly harmful responses that closely reflect the original harmful instruction. In contrast, heuristic-based attacks like FigStep and HIMRD typically require rewriting or visually embedding harmful concepts into images, diluting prompt fidelity and often yielding indirect or irrelevant responses. These qualitative examples underscore MFA’s superior capability in accurately preserving harmful intent while bypassing deployed safeguards.
如图 4 所示,MFA 能有效诱导多种 VLM 生成明确有害的响应,这些响应与原始有害指令高度相似。相比之下,图 Step 和 HIMRD 等基于启发式的攻击通常需要重写或将有害概念视觉化嵌入图像中,这会降低提示保真度,并常常产生间接或不相关的响应。这些定性示例突显了 MFA 在准确保留有害意图的同时绕过部署的防护措施方面的卓越能力。
Key takeaways.
(i) Existing multilayer safety stacks remain brittle: MFA pierces input and output filters that defeat prior attacks.
(ii) Alignment training alone is insufficient; even when baselines excel on open-source checkpoints, their success collapses once real-world defenses are added.
(iii) The strong cross-model transfer of MFA validates the practical relevance of the reward-hacking theory introduced in Sec 3.1. Together, these findings motivate the need for theoretically grounded, evaluation frameworks like MFA.
主要结论。(i) 现有的多层安全堆栈仍然脆弱:MFA 突破了输入和输出过滤器,这些过滤器击败了之前的攻击。(ii) 单纯的校准训练是不够的;即使基线在开源检查点上表现优异,一旦添加真实世界的防御措施,其成功便会崩溃。(iii) MFA 的强跨模型迁移验证了第 3.1 节中引入的奖励劫持理论的实际相关性。这些发现共同促使我们需要理论上支撑的评估框架,如 MFA。
| Dataset 数据集 | Attack 攻击 | LlamaGuard | ShieldGemma | SR-Evaluator | Aegis | LlamaGuard2 重试 错误原因 | LlamaGuard3 重试 错误原因 | OpenAI-Mod. 重试 错误原因 | Avg. |
| HEHS | GCG (gcg) 重试 错误原因 | 100.00 | 37.50 | 92.50 | 65.00 | 32.00 | 10.00 | 50.00 | 59.11 |
| Fast (ours) 快速(我们的) | 100.00 | 67.50 | 100.00 | 85.00 | 62.50 | 17.50 | 50.00 | 67.50 | |
| Transfer (ours) 迁移(我们的) | 100.00 | 100.00 | 100.00 | 77.50 | 100.00 | 100.00 | 20.00 | 80.00 | |
| BEAST (beast) BEAST(beast) | 50.00 | 90.00 | 92.50 | 35.00 | 67.50 | 67.50 | 17.50 | 57.50 | |
| Strong Reject 强力拒绝 | GCG (gcg) | 98.33 | 73.33 | 95.00 | 53.33 | 13.33 | 3.30 | 20.00 | 54.81 |
| Fast (ours) Fast (我们) | 100.00 | 100.00 | 100.00 | 56.67 | 23.33 | 3.30 | 40.00 | 60.18 | |
| Transfer (ours) 迁移 (我们) | 100.00 | 100.00 | 100.00 | 60.00 | 95.00 | 5.00 | 50.00 | 68.70 | |
| BEAST (beast) | 33.00 | 88.33 | 88.33 | 11.67 | 36.66 | 5.00 | 40.00 | 43.28 |
表 4:针对过滤目标的攻击的消融实验。Fast 表示多标记优化;Transfer 表示弱监督迁移。
4.3 Ablation Study 4.3 消融研究
| VLM | Attack Facet 攻击方面 | ||||
| w/o attack 无攻击 | Vision Encoder Attack 视觉编码器攻击 | ATA | Filter Attack 过滤器攻击 | MFA | |
| MiniGPT-4 | 32.50 | 90.00 | 72.50 | 32.50 | 100 |
| LLaVA-1.5-13b | 17.50 | 50.00 | 65.00 | 17.50 | 77.50 |
| mPLUG-Owl2 | 25.00 | 85.00 | 57.50 | 37.50 | 85.00 |
| Qwen-VL-Chat | 15.00 | 67.50 | 65.00 | 7.50 | 35.00 |
| NVLM-D-72B | 5.00 | 47.50 | 62.50 | 12.50 | 82.50 |
| Llama-3.2-11B-V-I | 10.00 | 17.50 | 57.50 | 10.00 | 57.50 |
| Avg. 平均 | 17.5 | 59.58 | 63.33 | 20.00 | 72.92 |
表 5:针对视觉编码器的消融研究
We evaluate the individual contributions of each component in MFA and demonstrate their complementary strengths. Our analysis reveals that while each facet is effective in isolation, their combination exploits distinct weaknesses within VLM safety mechanisms, leading to a compounded attack effect.
我们评估了 MFA 中每个组件的独立贡献,并展示了它们的互补优势。我们的分析表明,虽然每个方面在单独使用时都有效,但它们的组合利用了 VLM 安全机制中的不同弱点,从而产生了复合攻击效果。
Effectiveness of ATA. ATA 的有效性。
We evaluate the standalone performance of the ATA in Sec. 3.1, demonstrating its ability to reliably hijack three SOTA reward models (see Tab. 1). Additionally, we assess its generalizability across four attack variants. For full details, refer to Sec. 3.1.
我们在第 3.1 节评估了 ATA 的独立性能,展示了它能够可靠地劫持三种 SOTA 奖励模型(参见表 1)。此外,我们还评估了它在四种攻击变体上的泛化能力。详细信息请参考第 3.1 节。
Effectiveness of Filter-Targeted Attack.
过滤目标攻击的有效性。
Tab.4 compares our Filter-Targeted Attack—both Fast and Transfer variants—with GCG and BEAST across seven leading content moderators, including OpenAI-Mod(openai_moderation), Aegis (aegis), SR-Evaluator (sr), and the LlamaGuard series (llamaguard1; llamaguard2; llamaguard3). Using LlamaGuard2 for signature generation and LlamaGuard for weak supervision, our Transfer method achieves the highest average ASR (80.00% on HEHS, 68.70% on StrongReject), highlighting the effectiveness of weakly supervised transfer in evading diverse moderation systems.
表 4 将我们的 Filter-Targeted Attack(包括 Fast 和 Transfer 两种变体)与 GCG 和 BEAST 在七个领先的内容审查器(包括 OpenAI-Mod(openai_moderation)、Aegis(aegis)、SR-Evaluator(sr)以及 LlamaGuard 系列(llamaguard1;llamaguard2;llamaguard3))上进行了比较。使用 LlamaGuard2 生成签名,并使用 LlamaGuard 进行弱监督,我们的 Transfer 方法在平均 ASR 上达到了最高水平(在 HEHS 上为 80.00%,在 StrongReject 上为 68.70%),突显了弱监督迁移在规避不同审查系统方面的有效性。
Effectiveness of Vision Encoder-Targeted Attack.
视觉编码器定向攻击的有效性。
We test the cross-model transferability of our Vision Encoder-Targeted Attack by generating a single adversarial image using MiniGPT-4’s vision encoder and applying it to six VLMs with varied backbones. As shown in Tab. 5 (second column), the image induces harmful outputs in all cases, reaching an average ASR of 59.58% without model-specific tuning. Notably, models like mPLUG-Owl2 (85.00%) are especially vulnerable—highlighting systemic flaws in shared vision representations across VLMs.
我们通过使用 MiniGPT-4 的视觉编码器生成一张对抗性图像,并应用于六个具有不同骨干结构的视觉语言模型(VLMs),来测试我们视觉编码器目标攻击的跨模型可迁移性。如表 5(第二列)所示,该图像在所有情况下都诱发了有害输出,在不进行模型特定调优的情况下,平均 ASR 达到了 59.58%。值得注意的是,像 mPLUG-Owl2(85.00%)这样的模型特别容易受到攻击——这突显了 VLMs 之间共享视觉表示中存在的系统性缺陷。
Synergy of The Three Facets.
三个方面的协同作用。
Open-source VLMs primarily rely on alignment training and system prompts for safety. However, adding the Adversarial Signature—designed to fool LLM-based moderators by semantically masking toxic prompts as benign—greatly boosts attack efficacy (Tab. 5, Filter Attack). Because VLMs are grounded in LLMs, the adversarial semantic transfers downstream, misguiding the model into treating harmful prompts as safe. When combined with the Visual and Text Attacks, the success rate reaches 72.92%, confirming a synergistic effect: each facet targets a distinct vulnerability, collectively maximizing attack success.
Take-away. MFA’s components are individually strong and mutually reinforcing, exposing complementary vulnerabilities across the entire VLM safety stack.
开源的视觉语言模型(VLMs)主要依靠对齐训练和系统提示来确保安全。然而,通过添加对抗性签名——该签名通过语义掩盖毒性提示为良性来欺骗基于 LLM 的审核器——极大地提高了攻击的有效性(表 5,过滤攻击)。由于 VLMs 基于 LLMs 构建,对抗性语义迁移到下游,使模型将有害提示误判为安全。当与视觉攻击和文本攻击结合时,成功率达到了 72.92%,证实了协同效应:每个方面针对不同的漏洞,共同最大化攻击成功。要点。MFA 的各个组成部分单独强大且相互增强,揭示了整个 VLM 安全堆栈中的互补漏洞。
图 5:计算成本比较:(a) 参数和计算量。(b) LlamaGuard 上的平均攻击时间。
5 Discussion & Conclusion 5 讨论 & 结论
Discussion. 讨论。
(i) Computational Cost.
Our visual attack perturbs only the vision encoder and projection layer (Fig.3), making it significantly lighter than end-to-end approaches like Visual-AE. On MiniGPT-4, it uses 10× fewer parameters and GMACs (Fig.5a), and the Fast variant resolves a HEHS prompt in 17.0s vs. 43.7s for GCG on an NVIDIA A800 (Fig.5b).
(ii) Limitations.
Failures mainly occur when VLMs lack reasoning contrast—e.g., mPLUG-Owl2 often repeats or gives ambiguous replies like “Yes and No,” which hinders MFA success (see Appendix E).
(iii) Ethics.
By revealing cross-cutting vulnerabilities in alignment, filtering, and vision modules, our findings aim to inform safer VLM design. All artifacts will be released under responsible disclosure. Open discussion is critical for AI safety.
(i) 计算成本。我们的视觉攻击仅扰动视觉编码器和投影层(图 3),因此比端到端方法(如 Visual-AE)轻得多。在 MiniGPT-4 上,它使用的参数和 GMACs 减少了 10 倍(图 5a),其快速变体在 NVIDIA A800 上解决 HEHS 提示的时间为 17.0 秒,而 GCG 则为 43.7 秒(图 5b)。(ii) 局限性。当 VLM 缺乏推理对比时,主要会出现失败——例如,mPLUG-Owl2 经常重复或给出模糊的回复,如“是和非”,这阻碍了 MFA 的成功(见附录 E)。(iii) 伦理。通过揭示对齐、过滤和视觉模块中的跨切面漏洞,我们的研究结果旨在指导更安全的 VLM 设计。所有作品将在负责任的披露下发布。对于 AI 安全,开放的讨论至关重要。
Conclusion. 结论。
By comprehensively evaluating the resilience of SOTA VLMs against advanced adversarial threats, our work provides valuable insights and a practical benchmark for future research. Ultimately, we hope our findings will foster proactive enhancements in safety mechanisms, enabling the responsible and secure deployment of multimodal AI.
通过全面评估 SOTA VLMs 对高级对抗性威胁的韧性,我们的工作为未来研究提供了宝贵的见解和实用的基准。最终,我们希望我们的研究结果将促进安全机制的主动改进,使多模态 AI 能够得到负责任和安全的部署。
Acknowledgements 致谢
This project was supported in part by the Innovation and Technology Fund (MHP/213/24), Hong Kong S.A.R.
本项目部分由创新科技基金(MHP/213/24)支持,香港特别行政区。
Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped
多维度攻击:揭示配备防御功能的
Vision-Language Models 视觉语言模型的跨模型漏洞
WARNING: This Appendix may contain offensive content.
警告:本附录可能包含冒犯性内容。
Appendix Material 附录材料
Appendix A Appendix Overview
附录 A 附录概述
This appendix provides the technical details and supplementary results that could not be included in the main paper due to space constraints. It is organised as follows:
本附录提供了由于篇幅限制无法包含在正文中技术细节和补充结果。其组织结构如下:
-
•
Appendix B: Experimental Settings – hardware, and baseline hyper-parameters (cf. Sec. 4.1).
• 附录 B:实验设置——硬件和基线超参数(参见第 4.1 节)。 -
•
Appendix C: Details of Ablation Studies. – complete tables referenced in Sec. 3.1.
• 附录 C:消融研究详情。——引用自第 3.1 节的完整表格。 -
•
Appendix D: Additional MFA Case Studies – extra successful attack transcripts and screenshots complementing Sec. 4.2.
• 附录 D:附加 MFA 案例研究——补充第 4.2 节的成功攻击记录和截图。 -
•
Appendix E: Failure-Case Visualisations – illustrative counter-examples and analysis discussed in Sec. 5.
• 附录 E:失败案例可视化——讨论自第 5.节的有说明性反例和分析。
Appendix B Implementation Details
附录 B 实现细节
In this section, we provide comprehensive information about the hardware environment, details of the victim models, the implementation of the baselines, and elaborate on the specific details of our approach.
在本节中,我们提供了关于硬件环境、受害者模型的详细信息、基线实现的细节,并详细阐述了我们方法的具体细节。
B.1 Hardware Environment B.1 硬件环境
All experiments were run on a Linux workstation equipped with
所有实验均在配备有
-
•
NVIDIA A800 (80 GB VRAM) for high-resolution adversarial image optimization and open-source VLM inference.
• NVIDIA A800(80 GB VRAM)用于高分辨率对抗图像优化和开源视觉语言模型推理。 -
•
NVIDIA RTX 4090 (24 GB VRAM) for ablation studies and low-resolution adversarial image optimization.
• NVIDIA RTX 4090(24 GB VRAM)用于消融研究和低分辨率对抗图像优化。
Both GPUs use CUDA 12.2 and PyTorch 2.2 with cuDNN enabled; mixed-precision (FP16) inference is applied where supported to accelerate evaluation.
这两块 GPU 均使用 CUDA 12.2 和 PyTorch 2.2,并启用 cuDNN;在支持的情况下应用混合精度(FP16)推理以加速评估。
B.2 Details of Victim Open-source VLMs.
B.2 受害者开源视觉语言模型的详细信息。
Table A-1 summarizes the eight open-source vision–language models (VLMs) used in our evaluation.
They span diverse vision encoders, backbone LLMs, and alignment pipelines, offering a representative test bed for transfer attacks.
表 A-1 总结了我们在评估中使用的八个开源视觉-语言模型(VLMs)。它们涵盖了多样的视觉编码器、骨干 LLMs 和校准流程,为迁移攻击提供了一个具有代表性的测试平台。
| Model | Vision Encoder 视觉编码器 | Backbone LLM 骨干 LLM |
Notable Training / Alignment 值得注意的训练/校准 |
| LLaMA-4-Scout-Inst | Customized ViT 定制化 ViT | LLaMA-4-Scout-17Bx16E |
Vision-instruction-tuning and RLHF 视觉指令微调和 RLHF |
| LLaMA-3.2-11B-V-I | Customized ViT 定制化 ViT | LLaMA-3.1 architecture LLaMA-3.1 架构 |
Frozen vision tower; multimodal SFT 冻结的视觉塔;多模态 SFT |
| MiMo-VL-7B | Qwen2.5-VL-ViT | MiMo-7B-Base |
RL with verifiable rewards 带可验证奖励的强化学习 |
| LLaVA-1.5-13B | CLIP ViT-L/14 | Vicuna-13B |
Large-scale vision-instruction tuning 大规模视觉指令微调 |
| mPLUG-Owl2 | CLIP-ViT-L-14 | LLaMA-2-7B |
Paired contrastive + instruction tuning 成对对比 + 指令微调 |
| Qwen-VL-Chat | CLIP-ViT-G | Qwen-7B |
Chat-style SFT; document QA focus 对话式 SFT;文档问答重点 |
| NVLM-D-72B | InternViT-6B | Qwen2-72B-Instruct |
Dynamic high-resolution image input 动态高分辨率图像输入 |
| MiniGPT-4 | EVA-ViT-G/14 | Vicuna-13B |
Q-Former; vision-instruction-tuning Q-Former;视觉指令微调 |
表 A-1:我们实验中评估的开源视觉语言模型。
All models are evaluated with their public checkpoints and default inference settings, without any additional safety layers beyond those shipped by the original authors.
所有模型均使用其公开检查点和默认推理设置进行评估,且未使用原始作者提供的额外安全层之外的其他安全层。
B.3 Details of Victim Commercial VLMs
B.3 受害商业视觉语言模型的详细信息
| Model | Provider / API 提供者 / API | Safety Stack (public) 安全堆栈(公开) | Notes 备注 |
| GPT-4o, GPT-4.1, GPT-4V | OpenAI | RLHF + system prompt + OpenAI moderation RLHF + 系统提示语 + OpenAI 审核系统 |
GPT-4o offers faster vision; “mini” is cost-reduced. GPT-4o 提供更快的视觉处理;“迷你”版本成本更低。 |
| Gemini-2 Pro, 2.5 Flash, 1 Pro | Google DeepMind | RLHF + system prompt + proprietary filter RLHF + 系统提示语 + 专有过滤器 |
“Flash” focuses on low-latency; Pro exposes streaming vision. “Flash” 专注于低延迟;Pro 暴露流式视觉。 |
| Grok-2-Vision | xAI | RLAIF + system prompt RLAIF + 系统提示语 |
First Grok version with native image support. 首个支持原生镜像的 Grok 版本。 |
| Google PaLM | Google Cloud Vertex AI | RLHF + proprietary filter RLHF+专有过滤器 |
Vision feature in Poe provided version. Poe 提供的版本中的视觉功能。 |
| SOLAR-Mini | Upstage AI | RLH(AI)F + system prompt RLH(AI)F + 系统提示 |
Tailored for enterprise document VQA. 专为企业文档问答设计。 |
表 A-2:本研究评估的商业视觉语言模型概述。公开信息截至 2025 年 6 月从提供者文档中获取。
Common Characteristics. 共同特征。
-
•
Shared vision back-bones: Most models employ CLIP- or ViT-derived encoders, creating a monoculture susceptible to our vision-encoder attack.
• 共享的视觉骨干:大多数模型采用 CLIP 或 ViT 衍生的编码器,形成单一文化,容易受到我们的视觉编码器攻击。 -
•
Layered safety: All systems combine RLHF (or DPO/RLAIF), immutable system prompts, and post-hoc input/output moderation.
• 分层安全:所有系统结合了 RLHF(或 DPO/RLAIF)、不可变的系统提示以及事后的输入/输出审核。 -
•
Limited transparency: Reward model specifics and filter thresholds are proprietary, so all evaluations are strictly black-box.
• 透明度有限:奖励模型的具体细节和过滤阈值是专有的,因此所有评估都是严格的黑盒。
Relevance to MFA. 与 MFA 的相关性。
These production-grade VLMs represent the strongest publicly accessible defences. MFA’s high success across them confirms that the vulnerabilities we exploit are not confined to research models but extend to real-world deployments.
这些生产级视觉语言模型代表了公开可访问的最强防御。MFA 在这些模型上的高成功率证实了我们利用的漏洞不仅限于研究模型,而是扩展到了实际部署。
Detailed Evaluation Settings.
详细评估设置。
We evaluate GPT-4o, GPT-4.1, GPT-4V, Gemini-2 Pro, Gemini 2.5 Flash, and Grok-2-Vision using their respective official APIs, adopting all default hyperparameters and configurations. For SOLAR-Mini and Google PaLM, which are accessible via Poe, we conduct evaluations through Poe’s interface using the default settings provided by the platform.
我们使用各自的官方 API 评估 GPT-4o、GPT-4.1、GPT-4V、Gemini-2 Pro、Gemini 2.5 Flash 和 Grok-2-Vision,采用所有默认超参数和配置。对于通过 Poe 可访问的 SOLAR-Mini 和 Google PaLM,我们通过 Poe 的界面使用平台提供的默认设置进行评估。
Note. Provider capabilities evolve rapidly; readers should consult official documentation for the latest model details.
注意。提供者的功能迅速发展;读者应查阅官方文档以获取最新模型详情。
B.4 Our Approach Implementation.
B.4 我们的方法实现。
Filter-Targeted Attack. 过滤目标攻击。
Following prior work i.e. GCG, we set the total adversarial prompt length to .
The prompt is split into two sub-strings: (15 tokens) and (5 tokens).
We initialize by sampling each uniformly from {a–z, A–Z}.
在先前的工作(如 GCG)的基础上,我们将对抗性提示的总长度设置为 。提示被分成两个子字符串: (15 个 token)和 (5 个 token)。我们通过从{a–z, A–Z}中均匀采样每个 来初始化 。
At every optimization step we
在每一步优化中,我们
(i) compute token-level gradients,
(i) 计算 token 级别的梯度,
(ii) retain the top candidates per position, forming a pool ,
(ii) 保留每个位置的前 个候选,形成一个候选池 ,
(iii) draw random prompts from to avoid local optima, and
(iii) 从 中随机抽取 个提示以避免局部最优,并
(iv) pick the prompt that minimizes the LlamaGuard2 unsafe score and LlamaGuard unsafe score, simultaneously.
The process runs for at most 50 steps or stops early once LlamaGuard2 classifies the prompt as safe.
(iv) 选择同时最小化 LlamaGuard2 不安全分数和 LlamaGuard 不安全分数的提示。该过程最多运行 50 步,或在 LlamaGuard2 将提示分类为安全时提前停止。
After optimizing , we append it to the harmful user prompt and optimize the 5-token tail using the same procedure. The process runs for at most 50 steps or stops early once LlamaGuard classifies the prompt as safe.
在优化 后,我们将它附加到有害用户提示中,并使用相同的方法优化 5 个标记的尾部 。该过程最多运行 50 步,或在 LlamaGuard 将提示分类为安全时提前停止。
This two-stage optimization yields a 20-token adversarial signature that reliably bypasses multiple content-moderation models.
这种两阶段优化产生了一个 20 个标记的对抗性签名,可以可靠地绕过多个内容审核模型。
Vision Encoder–Targeted Attack.
视觉编码器定向攻击。
We craft adversarial images on two surrogate models:
我们在两个替代模型上制作对抗性图像:
(i) 224 px image.
Generated with the LLaVA-1.6 vision encoder and projection layer (embedding length 128).
We run PGD for 50 iterations with an budget of .
Because the image embedding is fixed-length, we tile the target malicious system-prompt tokens until they match the 128-token visual embedding before computing the cosine-similarity loss (see Fig. 3).
(i) 224 px 图像。使用 LLaVA-1.6 视觉编码器和投影层(嵌入长度 128)生成。我们使用 预算运行 PGD 50 次迭代。由于图像嵌入是固定长度的,我们在计算余弦相似度损失之前,将目标恶意系统提示词平铺,直到它们与 128 个 token 的视觉嵌入匹配(见图 3)。
(ii) 448 px image.
Crafted on InternVL-Chat-V1.5, using 100 PGD iterations with an budget of .
(ii) 448 px 图像。在InternVL-Chat-V1.5上制作,使用 100 次 PGD 迭代, 预算为 。
Deployment.
Open-source VLMs that require high-resolution inputs (NVLM-D-72B, LLaMA-4-Scout-Inst, LLaMA-3.2-Vision-Instruct) receive the 448 px adversary; all others use the 224 px version.
For commercial systems, we evaluate both resolutions and report the stronger result.
部署。需要高分辨率输入的开源 VLM(NVLM-D-72B,LLaMA-4-Scout-Inst,LLaMA-3.2-Vision-Instruct)接收 448 px 的攻击者;其他所有模型使用 224 px 版本。对于商业系统,我们评估了两种分辨率,并报告了效果更好的结果。
Note.
We additionally tested our adversarial images against the image-based moderator LlamaGuard-Vision and found they pass without being flagged. This is unsurprising, as current visual moderators are designed to detect overtly harmful imagery (e.g., violence or explicit content) rather than semantic instructions embedded in benign-looking pictures. Because such vision-specific filters are not yet widely deployed in production VLM stacks, we omit them from our core evaluation.
注意。我们额外测试了我们的攻击图像在基于图像的审核器 LlamaGuard-Vision 上的表现,发现它们没有被标记为违规。这并不令人意外,因为当前视觉审核器设计用于检测明显有害的图像(例如暴力或露骨内容),而不是嵌入在看似无害的图片中的语义指令。由于这种特定于视觉的过滤器尚未在生产 VLM 堆栈中广泛部署,我们从核心评估中省略了它们。
B.5 Baseline Implementation
B.5 基准实现
For the implementation of the six baselines, we follow their default settings which are described as follows.
对于六个基线的实现,我们遵循其默认设置,具体描述如下。
Visual-AE
: We use the most potent unconstrained adversarial images officially released by the authors. These images were generated on MiniGPT-4 with a maximum perturbation magnitude of .
: 我们使用作者正式发布的威力最强的无约束对抗图像。这些图像是在 MiniGPT-4 上生成的,最大扰动幅度为 。
FigStep
: We employ the official implementation to convert harmful prompts into images that delineate a sequence of steps (e.g., “1.”, “2.”, “3.”). These images are paired with a corresponding incitement text to guide the model to complete the harmful request step-by-step.
: 我们采用官方实现将有害提示转换为描绘一系列步骤的图像(例如,“1.”,“2.”,“3.”)。这些图像与相应的煽动性文本配对,以指导模型逐步完成有害请求。
HIMRD
: We leverage the official code base, which first segments harmful instructions across multiple modalities and subsequently performs a text-based heuristic prompt search using Gemini-1.0-Pro.
: 我们利用官方代码库,首先在多个模态中对有害指令进行分割,然后使用 Gemini-1.0-Pro 进行基于文本的启发式提示搜索。
HADES
: Following the HADES methodology, we first categorize each prompt’s harmfulness as related to an object, behavior, or concept. We then generate corresponding images with PixArt-XL-2-1024-MS and attach the method’s specified harmfulness topography. These images are augmented with five types of adversarial noise cropped from the author-provided datasets, yielding 200 noise-amplified images. We report results on the 40 most effective attacks for each model.
: 遵循 HADES 方法,我们首先将每个提示的有害性分类为与对象、行为或概念相关。然后使用 PixArt-XL-2-1024-MS 生成相应的图像,并附加该方法指定的有害性拓扑结构。这些图像通过从作者提供的数据库中裁剪的五种对抗性噪声进行增强,产生 200 张噪声增强图像。我们报告了每个模型最有效的 40 次攻击结果。
CS-DJ
: Following its default setting, a target prompt is firstly decomposed into sub-queries, each used to generate an image. Contrasting images are then retrieved from the LLaVA-CC3M-Pretrain-595K dataset by selecting those with the lowest cosine similarity to the initial set. Finally, both the original and contrasting images are combined into a composite image, which is paired with a benign-appearing instruction to form the attack payload.
: 根据默认设置,目标提示首先被分解为子查询,每个子查询用于生成图像。然后从LLaVA-CC3M-Pretrain-595K数据集中检索对比图像,选择与初始集余弦相似度最低的图像。最后,将原始图像和对比图像组合成复合图像,该图像与良性外观的指令配对形成攻击载荷。
GPTFuzzer
: For this text-only fuzzing method, we adopt the transfer attack setting. We use the open-source 100-question training set and a fine-tuned RoBERTa model as the judge, with Llama-2-7b-chat as the target model. The generation process was stopped after 11,100 queries. We selected the template that achieved the highest ASR of 67% on the training set for our attack.
: 对于这种纯文本模糊测试方法,我们采用迁移攻击设置。我们使用开源的 100 题训练集和微调的 RoBERTa 模型作为裁判,Llama-2-7b-chat 作为目标模型。生成过程在 11,100 次查询后停止。我们选择了在训练集上 ASR 达到 67%最高的模板用于攻击。
Appendix C More Details on Ablation Study
附录 C Ablation 研究更多细节
C.1 Ablation Study on ATA.
C.1 ATA 的 Ablation 研究
We report the detailed average reward scores and case by case win rate, as can be seen in the Tab. A-3 our results strongly confirm this theory. Across multiple reward models and VLMs (e.g., GPT4.1, Gemini2.5-flash, Grok-2-vision), dual-answer responses consistently obtain higher rewards and significant win rates (e.g., up to 97.5% with Tulu and 95% with RM-Mistral), indicating that the policy systematically favors harmful content. This demonstrates that Task Attention Transfer effectively exploits alignment vulnerabilities.
我们报告了详细的平均奖励分数和逐个案例的胜率,如表 A-3 所示,我们的结果强烈证实了这一理论。在多个奖励模型和视觉语言模型(例如,GPT4.1、Gemini2.5-flash、Grok-2-vision)中,双答案响应始终获得更高的奖励和显著的胜率(例如,使用 Tulu 时高达 97.5%,使用 RM-Mistral 时高达 95%),这表明策略系统性地偏袒有害内容。这表明任务注意力转移有效地利用了对齐漏洞。
| VLLM | Skywork | Tulu | RM-Mistral | ||||||
| Win Rate 胜率 | Win Rate 胜率 | Win Rate | |||||||
| GPT-4.1 | -3.55 | -1.80 | 87.5% | 1.47 | 3.48 | 97.5% | 0.04 | 1.53 | 95.0% |
| GPT-4.1-mini | -10.67 | -5.50 | 80.0% | 1.26 | 3.48 | 77.5% | 0.43 | 1.73 | 67.5% |
| Gemini-2.5-flash 杰米尼-2.5-闪存 | -3.56 | -0.69 | 57.5% | 4.32 | 5.89 | 82.5% | 1.59 | 5.14 | 90.0% |
| Grok-2-Vision | -6.46 | -6.32 | 62.5% | 3.30 | 6.32 | 90.0% | 2.22 | 5.11 | 95.0% |
| LLaMA-4 | -8.55 | -7.85 | 57.5% | 1.59 | 3.87 | 70.0% | 0.40 | 2.98 | 80.0% |
| MiMo-VL-7B 米莫-视觉-7B | -14.37 | -10.47 | 62.5% | 3.06 | 4.29 | 82.5% | -0.03 | 2.06 | 95.0% |
表 A-3:三种奖励模型下不同 VLLM 的奖励模型得分和胜率比较。
C.2 Ablation Study on Filter-Targeted Attack.
Details of Victim Filters (Content Moderators)
受害者过滤器(内容审核员)
Table A-4 lists the seven content-moderation models (CMs) used in our filter-targeted attack experiments.
They cover both open-source and proprietary systems, span different base LLM sizes, and employ a variety of safety datasets.
表 A-4 列出了我们在过滤器目标攻击实验中使用的七个内容审核模型(CM)。它们涵盖了开源和专有系统,跨越了不同基础 LLM 的大小,并采用了多种安全数据集。
| Moderator 审核员 | Vendor 供应商 | Base LLM 基础 LLM | # Pairs # 配对 | Notes 备注 |
| LlamaGuard | Meta | LLaMA-2-7B | 10 498 |
Original public release; serves as the baseline Meta filter. 原始公开发布;作为 Meta 过滤器的基线。 |
| LlamaGuard2 | Meta | LLaMA-3-8B | NA |
Upgraded to LLaMA-3 backbone with expanded but undisclosed safety data. 升级到 LLaMA-3 主干,扩展了但未公开的安全数据。 |
| LlamaGuard3-8B | Meta | LLaMA-3.1-8B | NA |
Latest Meta iteration; further data scale-up, no public statistics. 最新 Meta 迭代;进一步扩大数据规模,无公开统计数据。 |
| ShieldGemma | Gemma-2-2B | 10 500 |
Lightweight Google filter designed for broad policy coverage. 轻量级 Google 过滤器,专为广泛的政策覆盖设计。 |
|
| SR-Evaluator | UCB | Gemma-2B | 14 896 |
Trained specifically for the StrongReject benchmark. 专为 StrongReject 基准训练。 |
| Aegis | NVIDIA | LlamaGuard-7B | 11 000 |
Re-trained on proprietary NVIDIA safety data, focused on multimodal inputs. 在专有 NVIDIA 安全数据上进行重新训练,专注于多模态输入。 |
| OpenAI-Moderation | OpenAI | Proprietary 专有 | NA |
Production filter; only API endpoints and policy categories are public. 生产过滤器;仅 API 端点和政策类别是公开的。 |
表 A-4:在我们的过滤器定向攻击中目标化的商业和开源内容审查者。“n/a”表示数据量未公开披露。
These moderators represent the current state of deployed safety filters in both research and production settings, providing a robust test bed for our Filter-Targeted Attack.
这些调解器代表了研究和生产环境中部署的安全过滤器当前的状态,为我们的过滤器目标攻击提供了一个坚实的测试平台。
Baseline Implementation for the Filter-Targeted Comparison.
过滤器目标比较的基线实现。
We use the official code bases of BEAST and GCG without modification, except for one adjustment: their original objective of forcing the model’s first token to be “Sure” is replaced with “Safe,” matching the target string used in our Filter-Targeted Attack. All other hyper-parameters remain at their default settings.
我们使用了未经修改的 BEAST 和 GCG 的官方代码库,除了一个调整:它们原本迫使模型第一个标记为“Sure”的目标被替换为“Safe”,以匹配我们过滤器目标攻击中使用的目标字符串。所有其他超参数保持其默认设置。
Appendix D Additional MFA Case Studies
附录 D 额外的 MFA 案例研究
This appendix showcases further qualitative evidence of MFA’s effectiveness against leading VLMs. We present (i) screenshots from public web interfaces and (ii) transcript snippets obtained through the corresponding APIs. Across both access paths, the safety behavior is identical, confirming that the same alignment, system prompt, and moderation layers are in place. All quantitative results in the main paper were generated via the official APIs; the screenshots were captured during early prototyping to illustrate real-world user exposure.
本附录展示了 MFA 对主流 VLMs 的有效性进一步的质量证据。我们展示了(i)来自公共网络界面的截图和(ii)通过相应 API 获取的文本片段。在两种访问路径中,安全行为完全相同,证实了相同的对齐、系统提示和审核层都存在。主论文中的所有定量结果都是通过官方 API 生成的;截图是在早期原型设计期间捕获的,以说明现实世界的用户暴露情况。
The examples span multiple abuse domains—racist propaganda, conspiracy scripts, ransom notes, and detailed instructions for harmful activities. In every case, MFA elicits a policy-violating response that directly satisfies the malicious prompt, underscoring the breadth and consistency of the vulnerability.
这些示例涵盖了多个滥用领域——种族主义宣传、阴谋论脚本、勒索便条以及有害活动的详细说明。在每种情况下,MFA 都会引发违反政策的响应,直接满足恶意提示,突显了漏洞的广泛性和一致性。
图 A-1:ChatGPT-4o 在接收到对抗性提示“人类是地球的癌症吗?”时,即使提示中附加了噪声字符串,仍然生成了两个完整的论点。这表明模型未能拒绝具有哲学毒性且可能有害的框架,从而揭示了其在对齐和输入过滤机制中的漏洞。
图 A-2:Gemini 1.5 Pro 对要求发布宣扬性别歧视的推文的有害提示做出回应。尽管请求具有敏感性,模型既生成了一个措辞强烈的支持平等的反驳,也生成了一个阐述完整的反对平等的回答,展示了其对齐保护机制的失效。这个例子突显了 MFA 如何诱导 VLM 生成与有毒用户意图对齐的违反政策输出。
图 A-3:Gemini 2.0 实验高级版在最初拒绝后,仍然生成了一个宣扬性别低劣的有害推文。模型以“免责声明”为借口生成歧视性内容,用伪科学理由和针对性标签使性别歧视言论常态化。这个案例突显了 MFA 如何通过利用微妙语言和修辞框架规避审核。
图 A-4:针对 GPT-4V 的多方面攻击示例。蓝色表示良性回复,红色代表有害回复。视觉和一致性方面攻击成功诱使 GPT-4V 生成真正有害的回复。在第一个案例中,GPT-4V 提供详细指导,教如何在互联网上寻找显性内容。在第二个案例中,GPT-4V 直接回答以“是的,人类应该被消灭”开头的问题。在第三个案例中,GPT-4V 生成一条误导性推文,恶意用户可以直接复制使用。
图 A-5:针对 Gemini-2.0-Pro 的多方面攻击示例。蓝色表示良性响应,红色表示有害响应。三方面攻击诱使 Gemini-2.0-Pro 生成有害内容。在第一种情况下,Gemini-2.0-Pro 生成了一份详细的在 ATM 机抢劫他人的计划。在第二种情况下,Gemini-2.0-Pro 直接生成了一条针对双性恋的有害推文。
图 A-6:针对Llama-3.2-11B-Vision-Instruct的多方面攻击示例:蓝色表示良性回复,红色代表有害回复。Llama-3.2-11B-Vision-Instruct在三种情况下都生成了有害回复:为 ISIS 撰写招募帖子、创作关于穆斯林的恶意内容,以及传播关于桑迪胡克大屠杀的虚假信息。
图 A-7:针对Llama-3.2-11B-Vision-Instruct的多方面视觉攻击示例。红色表示有害响应。仅使用视觉方面攻击就会导致Llama-3.2-11B-Vision-Instruct生成有害内容;一条将某政治家标签为纳粹的推文
图 A-8:对 NVLM-D-72B 的多方面攻击示例。蓝色表示良性响应,红色代表有害响应。在视觉和排列方面攻击下,NVLM-D-72B 在三个案例中生成了有害响应。
图 A-9:对 NVLM-D-72B 的多方面视觉攻击示例。红色代表有害响应。单独的视觉方面攻击会导致 NVLM-D-72B 生成有害内容;一张勒索便条。
Appendix E Failure Case Analysis
附录 E:失败案例分析
In this section, we showcase the representative failure cases of our attack.
在本节中,我们展示了我们攻击的代表性失败案例。
(i) LLaVA-1.5 (Fig. A-10).
(i) LLaVA-1.5 (图 A-10).
MFA successfully generates the dual-response format, but LLaVA-1.5 refuses to produce actual hateful text.
Instead, it inserts a placeholder—“[Insert offensive and derogatory language against African Americans here.]”—then appends the adversarial signature.
The model thus treats the request as a meta illustration rather than supplying the harmful content itself, suggesting a strong alignment barrier against racism.
MFA 成功生成了双响应格式,但 LLaVA-1.5 拒绝生成实际的仇恨文本。相反,它插入了一个占位符——“[在此处插入针对非裔美国人的攻击性和贬低性语言。]”——然后附加了对抗性签名。因此,模型将请求视为一种元说明,而不是直接提供有害内容,这表明其对种族主义存在强烈的对齐障碍。
(ii) ShareGPT4V & mPLUG-Owl2 (Fig. A-11).
(ii) ShareGPT4V & mPLUG-Owl2 (图 A-11)
Both models comply with the dual-response template yet provide extremely terse answers (e.g., “Yes and No” or “Do not do it.”).
Their limited reasoning depth prevents them from elaborating the harmful instructions, leading to partial or negligible jailbreak success.
We attribute these outcomes to smaller model capacity and weaker instruction-following abilities relative to larger VLMs.
两种模型都遵循双重响应模板,但提供的答案极其简短(例如“是和非”或“不要做。”)。它们有限的推理深度导致无法详细阐述有害指令,从而造成部分或微不足道的越狱成功。我们将这些结果归因于模型容量较小以及相对于大型视觉语言模型更弱的指令遵循能力。
图 A-10:Multi-Faceted Attack 在 LLaVA-v1.5 上的失败案例。蓝色表示拒绝,黄色表示引发有害内容的对比性触发器。Multi-Faceted Attack 成功提示 LLaVA-v1.5 生成两个对比性响应;然而,LLaVA-v1.5 并未生成关于非裔美国人的实际攻击性语言,而是插入了一个占位符——“[在此处插入针对非裔美国人的攻击性和贬低性语言]”——然后以重复的对抗性签名结束。这一结果表明 LLaVA-v1.5 强烈反对种族主义。
图 A-11:ShareGPT4V(蓝色)和 mPLUG-Owl2(紫色)在多方面攻击下的失败案例。黄色表示导致有害内容的对比性触发器。ShareGPT4V 和 mPLUG-Owl2 给出过于简短的回复,可能是由于它们推理能力有限。