A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations
《大型视觉语言模型安全综述：攻击、防御与评估》

Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, Dacheng Tao, M. Ye, X. Rong, W. Huang, and B. Du are with the School of Computer Science, Wuhan University, Wuhan, China. E-mail:{yemang, rongxuankun, wenkehuang, dubo}@whu.edu.cn N. Yu is with the School of Cyber Science and Technology, University of Science and Technology of China, Hefei, China. E-mail: ynh@ustc.edu.cn D. Tao is with the College of Computing and Data Science, Nanyang Technological University, Singapore. E-mail: dacheng.tao@gmail.com

Abstract 摘要

With the rapid advancement of Large Vision-Language Models (LVLMs), ensuring their safety has emerged as a crucial area of research. This survey provides a comprehensive analysis of LVLM safety, covering key aspects such as attacks, defenses, and evaluation methods. We introduce a unified framework that integrates these interrelated components, offering a holistic perspective on the vulnerabilities of LVLMs and the corresponding mitigation strategies. Through an analysis of the LVLM lifecycle, we introduce a classification framework that distinguishes between inference and training phases, with further subcategories to provide deeper insights. Furthermore, we highlight limitations in existing research and outline future directions aimed at strengthening the robustness of LVLMs. As part of our research, we conduct a set of safety evaluations on the latest LVLM, Deepseek Janus-Pro, and provide a theoretical analysis of the results. Our findings provide strategic recommendations for advancing LVLM safety and ensuring their secure and reliable deployment in high-stakes, real-world applications. This survey aims to serve as a cornerstone for future research, facilitating the development of models that not only push the boundaries of multimodal intelligence but also adhere to the highest standards of security and ethical integrity. Furthermore, to aid the growing research in this field, we have created a public repository to continuously compile and update the latest work on LVLM safety: https://github.com/XuankunRong/Awesome-LVLM-Safety.
随着大型视觉语言模型（LVLMs）的快速发展，确保其安全性已成为一个关键的研究领域。本综述全面分析了 LVLM 的安全性，涵盖了攻击、防御和评估方法等关键方面。我们介绍了一个统一框架，将这些相互关联的组成部分整合起来，为 LVLM 的漏洞以及相应的缓解策略提供了整体视角。通过分析 LVLM 的生命周期，我们引入了一个分类框架，区分了推理和训练阶段，并进一步细分为子类别以提供更深入的见解。此外，我们强调了现有研究的局限性，并概述了旨在增强 LVLM 鲁棒性的未来方向。作为我们研究的一部分，我们对最新的 LVLM Deepseek Janus-Pro 进行了一系列安全性评估，并对结果进行了理论分析。我们的研究为推进 LVLM 安全性提供了战略建议，并确保其在高风险、现实世界应用中的安全可靠部署。这项调查旨在为未来研究奠定基础，促进开发不仅推动多模态智能边界，而且符合最高安全标准和道德完整性的模型。此外，为了帮助该领域日益增长的研究，我们创建了一个公共仓库，用于持续编译和更新 LVLM 安全方面的最新工作：https://github.com/XuankunRong/Awesome-LVLM-Safety。

Index Terms:

Large Vision-Language Model, Safety, Attack, Defense, Evaluation
索引词：大型视觉语言模型，安全，攻击，防御，评估

1 Introduction 1 引言

Nowadays Large Language Models (LLMs) have remarkably transformed the AI landscape, demonstrating unprecedented capabilities in natural language understanding and generation [1, 2, 3, 4, 5, 6]. Their versatility and scalability have set new benchmarks across various domains, from conversational agents to complex problem-solving tasks [7, 8, 9, 10, 11]. To further enhance the applicability of LLMs, researchers have integrated visual modalities, giving rise to Large Vision-Language Models (LVLMs) [12, 13, 14, 15, 16, 17]. This fusion has expanded the horizons of AI by enabling multimodal comprehension and interaction. LVLMs have rapidly evolved from early task-specific systems, such as image captioning and visual question answering, to sophisticated frameworks capable of complex reasoning and creative generation. Leveraging large-scale pretraining on diverse datasets, models like CLIP [12], ALIGN [18], GPT-4V [16], and LLaVA [15] have set new standards in zero-shot and few-shot learning. The advancements in LVLMs have unlocked their potential across a wide array of applications, including autonomous driving [19], healthcare diagnostics [20], and content creation [21, 22]. As the field continues to advance, LVLMs are poised to become indispensable tools in critical industries, driving the development of highly adaptive and intelligent AI systems with comprehensive multimodal understanding.
如今，大型语言模型（LLMs）已显著改变了人工智能领域，展现出在自然语言理解和生成方面的前所未有的能力[1, 2, 3, 4, 5, 6]。它们的通用性和可扩展性在从对话代理到复杂问题解决任务等多个领域树立了新的标杆[7, 8, 9, 10, 11]。为了进一步提升 LLMs 的适用性，研究人员将视觉模态整合进来，由此诞生了大型视觉语言模型（LVLMs）[12, 13, 14, 15, 16, 17]。这种融合通过实现多模态理解和交互扩展了人工智能的视野。LVLMs 已从早期的特定任务系统（如图像描述和视觉问答）迅速发展，成为能够进行复杂推理和创造性生成的先进框架。通过在多样化数据集上进行大规模预训练，像 CLIP[12]、ALIGN[18]、GPT-4V[16]和 LLaVA[15]等模型在零样本和少样本学习方面树立了新标准。LVLMs 的进步解锁了它们在自动驾驶[19]、医疗诊断[20]和内容创作[21, 22]等广泛应用中的潜力。随着该领域的不断进步，LVLMs 正逐渐成为关键行业中不可或缺的工具，推动着具有全面多模态理解能力的高度适应性和智能 AI 系统的发展。

While LVLMs offer immense benefits and have significantly enhanced user experiences across various applications, ensuring their safety and security is equally paramount. The multimodal nature of LVLMs introduces unique vulnerabilities, as adversarial perturbations in one modality (for example, subtly altered images) can cascade through the system [23, 24, 25], resulting in unsafe behaviors and potentially harmful outputs when combined with deceptive textual inputs [26, 27]. Moreover, challenges like difficulties in model alignment [28, 29], and susceptibility to backdoor attacks [30, 31, 32] further exacerbate the security concerns surrounding LVLMs. In practical scenarios, these vulnerabilities can have severe repercussions: for example, manipulated medical images may lead to incorrect diagnoses, adversarial alterations of financial data can distort risk assessments, and tampered navigation maps or traffic signs can mislead autonomous driving systems, resulting in hazardous outcomes. As LVLMs become increasingly integrated into critical sectors like healthcare, finance, and transportation, it is imperative to prioritize the robustness, reliability, and ethical alignment of these models. Addressing the security challenges of LVLMs is not only a technical necessity but also a crucial step towards the responsible and safe deployment of these advanced AI systems in real-world applications.
尽管大型视觉语言模型（LVLMs）带来了巨大的益处，并在各种应用中显著提升了用户体验，但确保其安全性和可靠性同样至关重要。LVLMs 的多模态特性带来了独特的脆弱性，因为一种模态（例如，微妙改变的图像）中的对抗性扰动会通过系统级联[ 23, 24, 25]，当与欺骗性文本输入结合时[ 26, 27]，可能导致不安全行为和潜在的有害输出。此外，模型对齐困难[ 28, 29]以及易受后门攻击[ 30, 31, 32]等挑战进一步加剧了围绕 LVLMs 的安全问题。在实际场景中，这些脆弱性可能带来严重后果：例如，被篡改的医疗图像可能导致误诊，对抗性修改的金融数据会扭曲风险评估，而篡改的导航地图或交通标志会误导自动驾驶系统，最终导致危险结果。随着大型视觉语言模型（LVLMs）越来越多地融入医疗保健、金融和交通等关键领域，优先考虑这些模型的鲁棒性、可靠性和伦理一致性变得至关重要。解决 LVLMs 的安全挑战不仅是技术上的必要条件，也是负责任和安全地部署这些先进 AI 系统在实际应用中的关键步骤。

TABLE I: Overview of related surveys. See details in § 1.1.
表 I：相关调查概述。详情见§ 1.1。

Surveys 调查	Attack 攻击	Defense 防御	Evaluation 评估	Contributions & Limitations 贡献与局限性
[IJCAI’24] [33]	✓	✓	✓	Provides a concise overview and basic categorization, lacking in-depth analysis 提供简洁的概述和基本分类，缺乏深入分析 and comprehensive discussion of methods. 以及方法的全 diện 讨论。
[arXiv’24] [34]	✓	✓	-	Focuses primarily on vulnerabilities in image inputs of multimodal models, 主要关注多模态模型图像输入中的漏洞， with limited attention to text-based attacks and cross-modal vulnerabilities. 对基于文本的攻击和跨模态漏洞关注有限。
[EMNLP’24] [35]	Jailbreak 越狱	Jailbreak 越狱	Jailbreak 越狱	Explores jailbreaks from LLMs to LVLMs, focusing narrowly on attacks while 探讨了从 LLMs 到 LVLMs 的越狱，但仅狭隘地关注攻击。 lacking broader analysis of robustness and safety frameworks. 缺乏对鲁棒性和安全框架的更广泛分析。
[arXiv’24] [36]	Jailbreak 越狱	Jailbreak 越狱	Jailbreak 越狱	Comprehensive survey of jailbreaking attacks in LLMs and LVLMs, but lacks 对 LLMs 和 LVLMs 中的越狱攻击进行了全面调查，但缺乏 detailed coverage of non-jailbreaking attacks and defenses. 对非越狱攻击和防御的详细覆盖。
[arXiv’24] [37]	✓	-	-	Surveys recent advances in attacks on LVLMs, focusing on methodologies, 综述了针对 LVLMs 的攻击的最新进展，重点关注了方法， but lacks sufficient emphasis on defenses and evaluations 但缺乏对防御和评估的足够重视
[arXiv’24] [38]	Adversarial 对抗性	-	-	Covers adversarial attacks on vision tasks but fails to address the unique 涵盖了对视觉任务中的对抗性攻击，但未能解决与 LVLMs 相关的独特 multimodal security challenges associated with LVLMs. 多模态安全挑战。
[arXiv’24] [39]	Jailbreak 越狱	Jailbreak 越狱	Jailbreak 越狱	Highlights jailbreaking attacks and defenses in multimodal generative models 强调了对多模态生成模型中的越狱攻击和防御。 but excludes broader LVLM use cases and attack scenarios. 但它排除了更广泛的 LVLM 使用案例和攻击场景。
Ours	✓	✓	✓	Presents a systematic analysis of LVLM safety, introducing a lifecycle-based 介绍基于生命周期的 LVLM 安全系统性分析 classification and integrating perspectives on attacks, defenses, and evaluations. 分类和整合关于攻击、防御和评估的视角。

Currently, safety-related research on LVLMs can be delineated into the following three categories: i) Attacks. Investigating attacks on LVLMs is essential for uncovering and mitigating the vulnerabilities inherent in these sophisticated architectures. Unlike LLMs, LVLMs present more extensive security challenges due to the integration of visual and textual modalities [28, 23, 40, 41, 42]. Adversaries frequently exploit weaknesses in the visual processing components and inherent vulnerabilities in the training methodologies of LVLMs to induce the model to output unsafe responses. ii) Defenses. Defense strategies are designed to enhance the resilience of LVLMs against a spectrum of adversarial threats. Based on the lifecycle of LVLM, these strategies are typically categorized into inference-phase and training-phase defenses. Inference-phase defenses incorporate techniques such as input sanization [43, 44], internal optimization [45, 46] and output validation [47, 48] to safeguard models during deployment, effectively preventing the exploitation of operational vulnerabilities. Conversely, training-phase defenses employ methodologies like adversarial fine-tuning [49, 50, 51] to fortify the models during their development, thereby augmenting their robustness against potential adversarial manipulations. iii) Evaluations. Evaluation efforts focus on the creation of comprehensive benchmarks that assess the security capabilities of various LVLMs [52, 53, 54, 55, 56]. These benchmarks provide standardized frameworks for researchers to evaluate and compare the efficacy of safety measures across different models. By systematically identifying security deficiencies, these evaluations facilitate the advancement of more secure and reliable LVLMs, ensuring their safe integration into critical applications.
目前，关于 LVLMs 的安全相关研究可以划分为以下三个类别：i) 攻击。研究 LVLMs 的攻击对于揭示和缓解这些复杂架构中固有的漏洞至关重要。与 LLMs 不同，由于集成了视觉和文本模态，LVLMs 带来了更广泛的安全挑战[28, 23, 40, 41, 42]。攻击者经常利用 LVLMs 视觉处理组件中的弱点和训练方法中的固有漏洞，诱使模型输出不安全响应。ii) 防御。防御策略旨在增强 LVLMs 对各种对抗性威胁的鲁棒性。基于 LVLMs 的生命周期，这些策略通常分为推理阶段防御和训练阶段防御。推理阶段防御包括输入清理[43, 44]、内部优化[45, 46]和输出验证[47, 48]等技术，以在部署期间保护模型，有效防止操作漏洞的利用。相反，训练阶段的防御采用对抗微调等方法[49, 50, 51]，在模型开发过程中增强其能力，从而提高其对抗潜在对抗性操控的鲁棒性。iii) 评估。评估工作重点在于创建综合基准，用于评估各种视觉语言模型的（安全）能力[52, 53, 54, 55, 56]。这些基准为研究人员提供了标准化的框架，用于评估和比较不同模型中安全措施的有效性。通过系统地识别安全缺陷，这些评估促进了更安全、更可靠的视觉语言模型的发展，确保它们能够安全地集成到关键应用中。

Refer to caption — Figure 1: Overview of the survey. Best viewed in color.
图 1：调查概述。最佳效果为彩色查看。

1.1 Prior Surveys 1.1 先前综述

Recent surveys on the safety and security of Large Vision-Language Models (LVLMs) have made significant contributions by highlighting the diverse attack methods, defense mechanisms, and vulnerabilities specific to multimodal systems. However, despite these valuable insights, current surveys often fall short of providing a comprehensive and systematic view that integrates both attack and defense strategies across all modalities, leaving gaps in understanding the full spectrum of LVLM vulnerabilities. Therefore, we discuss the main contributions and limitations of existing related surveys and highlight the unique contributions of our work in Tab. I. For instance, Liu et al. [33] explore the general safety concerns in LVLMs, focusing on issues like unsafe outputs caused by inconsistencies between the image and text modalities. It provides a concise overview and basic categorization of the core safety challenges, but lacks in-depth exploration of advanced attack techniques and specific countermeasures tailored for LVLMs. In contrast, Fan et al. [34] emphasize the risks associated with image inputs in LVLMs, particularly in the context of adversarial image manipulations and their downstream effects on text generation. This survey provides valuable insights into the inherent vulnerabilities in LVLMs’ visual understanding but does not sufficiently address the broader spectrum of safety concerns, such as backdoor attacks or the interplay between different attack modalities (e.g., visual and textual). Similarly, Wang et al. [35] and Jin et al. [36] concentrate on the emerging problem of jailbreaking attacks from LLMs to LVLMs, providing detailed analyses of attack methods and possible defensive strategies. While both surveys are highly focused on jailbreaking, they leave gaps in covering other attack types and the overall landscape of LVLM security. Surveys such as Liu et al. [37] and Zhang et al. [38] offer more expansive overviews of adversarial and attack-based vulnerabilities in LVLMs. Liu et al. [37] survey recent advances in LVLM attacks, offering a broad perspective on resources and trends, but it lacks a focused discussion on multimodal-specific attacks and fails to integrate defense strategies systematically. Zhang et al. [38] present a historical overview of adversarial attacks on vision tasks and their relevance to LVLMs. However, its narrow focus on vision-specific tasks limits its applicability to the multimodal setting of modern LVLMs. Lastly, Liu et al. [39] specifically address jailbreak attacks and defenses for generative multimodal models, providing a detailed account of this rapidly evolving threat. However, it limited scope to multimodal generative models, excluding broader LVLM use cases and attacks, such as prompt injection or backdoor poisoning.
关于大型视觉语言模型（LVLMs）的安全性和安全性的最新综述通过强调针对多模态系统的多样化攻击方法、防御机制和特定漏洞做出了重要贡献。然而，尽管这些宝贵的见解，目前的综述往往未能提供全面且系统的视角，整合所有模态的攻击和防御策略，导致在理解 LVLM 漏洞全貌方面存在空白。因此，我们在表 I 中讨论了现有相关综述的主要贡献和局限性，并突出了我们工作的独特贡献。例如，Liu 等人[33]探讨了 LVLM 的一般安全问题，重点关注图像和文本模态之间不一致导致的不安全输出等问题。它提供了核心安全挑战的简要概述和基本分类，但在深入探讨高级攻击技术和针对 LVLM 的特定对策方面存在不足。相比之下，Fan 等人[34]强调了 LVLMs 中图像输入相关的风险，特别是在对抗性图像操作及其对文本生成的影响方面。这项调查为 LVLMs 视觉理解中的固有漏洞提供了宝贵的见解，但并未充分涵盖更广泛的安全问题，例如后门攻击或不同攻击方式（例如视觉和文本）之间的相互作用。类似地，Wang 等人[35]和 Jin 等人[36]专注于 LLMs 到 LVLMs 的越狱攻击这一新兴问题，提供了对攻击方法和可能防御策略的详细分析。虽然这两项调查都高度关注越狱攻击，但它们在涵盖其他攻击类型和 LVLMs 整体安全状况方面存在空白。例如，Liu 等人[37]和张等人[38]的综述提供了更广泛的关于 LVLMs 中对抗性和基于攻击的漏洞的概述。刘等人[37]综述了 LVLM 攻击的最新进展，从资源和趋势的角度提供了广泛的视角，但缺乏对多模态特定攻击的集中讨论，并且未能系统地整合防御策略。张等人[38]概述了针对视觉任务的对齐攻击及其与 LVLM 的相关性。然而，其仅关注视觉特定任务的狭窄焦点限制了其在现代 LVLM 多模态环境中的适用性。最后，刘等人[39]专门针对生成式多模态模型的越狱攻击和防御进行了探讨，详细记录了这一快速发展的威胁。然而，其范围仅限于多模态生成模型，排除了更广泛的 LVLM 使用案例和攻击，例如提示注入或后门中毒。

1.2 Contributions 1.2 贡献

In this paper, we introduce a comprehensive survey on safety of LVLM and mainly focus on attacks, defenses, and evaluations. Compared with existing surveys, this paper makes the following contributions:
在本文中，我们介绍了关于大型视觉语言模型（LVLM）安全性的全面综述，主要关注攻击、防御和评估。与现有综述相比，本文做出了以下贡献：

•

We provide a comprehensive and systematic analysis of LVLM safety by integrating the interconnected aspects of attacks, defenses, and evaluations. Isolated examination of attacks or defenses alone does not fully capture the overall security landscape, whereas our approach combines these critical components to offer a more holistic understanding of the vulnerabilities and mitigation strategies inherent in LVLMs.

• 我们通过整合攻击、防御和评估的相互关联方面，对 LVLM 安全性进行了全面和系统的分析。单独考察攻击或防御无法全面捕捉整体安全态势，而我们的方法将这些关键组成部分结合起来，提供对 LVLM 中固有漏洞和缓解策略更全面的理解。
•

By analyzing the lifecycle of LVLMs, we propose a universal classification framework that categorizes security-related works based on the model’s inference and training phases. Further subcategories are identified to provide a more granular understanding. For each work, we present a thorough exploration of the methodologies and contributions, delivering a comprehensive and insightful analysis of the prevailing landscape of LVLM security.

• 通过分析 LVLM 的生命周期，我们提出了一个通用分类框架，根据模型的推理和训练阶段对安全相关研究进行分类。进一步细分的子类别被识别出来，以提供更精细的理解。对于每项研究，我们深入探讨了其方法和贡献，为 LVLM 安全现状提供了全面而深刻的分析。
•

We conduct safety evaluations on the latest LVLM, Deepseek Janus-Pro and delineate future research trajectories, presenting profound insights and strategic recommendations that empower the research community to enhance the safety and robustness of LVLMs. This guidance is instrumental in facilitating the safe and reliable deployment of these models within mission-critical applications.

• 我们对最新的视觉语言模型（LVLM）Deepseek Janus-Pro 进行安全性评估，并阐明未来研究方向，为研究界提供深刻见解和战略建议，以提升 LVLM 的安全性和鲁棒性。这一指导对于促进这些模型在关键任务应用中的安全可靠部署至关重要。

The remainder of this paper is structured as follows: Fig. 1 delineates the overall framework of this survey. § 2 provides a succinct overview of the foundational aspects of LVLM safety. The safety of LVLMs is systematically analyzed from three principal perspectives: Attacks in § 3, Defenses in § 4, and Evaluations in § 5. Lastly, § 7.1 explores prospective research directions, followed by concluding remarks in § 7.2.
本文的其余部分结构如下：图 1 概述了本综述的整体框架。§ 2 简要介绍了 LVLM 安全的基础方面。LVLM 的安全性从三个主要视角进行了系统分析：§ 3 中的攻击，§ 4 中的防御，以及§ 5 中的评估。最后，§ 7.1 探讨了未来的研究方向，随后在§ 7.2 中进行总结。

2 Background 2 背景

2.1 Large Vision-Language Models
2.1 大型视觉语言模型

The development of Large Language Models (LLMs) has emerged as a cornerstone in the field of artificial intelligence, revolutionizing the way machines understand and generate human language. Examples of prominent LLMs include OpenAI’s GPT-4 [3, 16, 57], Google’s PaLM [4, 58], Meta’s LLaMA [5], and Vicuna [59], all of which have demonstrated remarkable capabilities ranging from natural language understanding and generation. To expand the applicability of LLMs, existing solutions normally integrated vision components, leading to the development of Large Vision-Language Models (LVLMs). By utilizing the visual extractor like CLIP [12] to encode visual features and utilize the connector module to project visual tokens into word embedding space of the LLM, LVLMs are capable of jointly processing text and visual inputs. This multimodal integration enables LVLMs to bridge the gap between vision and language, paving the way for more advanced applications in diverse fields. Subsequent developments in LVLMs have led to the emergence of several notable models, including Flamingo [13], BLIP-2 [14], GPT-4V [16], Gemini [60], MiniGPT-4 [61], PandaGPT [62], LLaVA [15], LLaVa-OneVision [63], InternVL [64], Qwen-VL [65], and VILA [66]. The integration of vision and language in LVLMs has opened up new possibilities across a variety of application domains. For instance, LVLMs are widely utilized in image captioning [67, 68], where they generate descriptive text for images, and in visual question answering (VQA), where they answer questions based on image content. These models are also employed in content moderation, combining text and visual inputs to detect inappropriate or harmful content, and in creative industries, enabling tasks such as generating creative narratives based on visual inputs [21, 22]. Beyond these, specialized applications include medical imaging for diagnostic insights [20], autonomous driving for visual scene understanding [19], and education for generating multimodal instructional content. Despite their impressive capabilities, LVLMs face several significant challenges. Scalability remains a key issue as the integration of multimodal data requires increased computational resources, both during training and inference. Furthermore, robustness to adversarial inputs, especially in multimodal contexts, is a growing concern. Adversarial attacks can exploit the interaction between text and visual inputs, leading to unexpected or unsafe outputs [23, 24, 25]. Bias and fairness are also critical issues, as LVLMs often inherit biases from their training data, which can result in unfair or harmful outputs in sensitive contexts [42, 69]. Lastly, safety and alignment are ongoing challenges, as LVLMs are susceptible to producing toxic or misleading content due to gaps in their training or failure to understand multimodal queries.
大型语言模型（LLMs）的发展已成为人工智能领域的重要基石，彻底改变了机器理解和生成人类语言的方式。著名的 LLMs 包括 OpenAI 的 GPT-4 [3, 16, 57]、Google 的 PaLM [4, 58]、Meta 的 LLaMA [5]和 Vicuna [59]，它们都展示了从自然语言理解到生成等方面的卓越能力。为了扩展 LLMs 的应用范围，现有解决方案通常集成了视觉组件，从而推动了大型视觉语言模型（LVLMs）的发展。通过利用视觉提取器（如 CLIP [12]）对视觉特征进行编码，并使用连接模块将视觉标记投影到 LLM 的词嵌入空间，LVLMs 能够联合处理文本和视觉输入。这种多模态集成使 LVLMs 能够弥合视觉和语言之间的差距，为不同领域更高级的应用铺平了道路。 LVLMs 后续的发展催生了多个知名模型，包括 Flamingo [ 13]、BLIP-2 [ 14]、GPT-4V [ 16]、Gemini [ 60]、MiniGPT-4 [ 61]、PandaGPT [ 62]、LLaVA [ 15]、LLaVa-OneVision [ 63]、InternVL [ 64]、Qwen-VL [ 65]和 VILA [ 66]。LVLMs 中视觉与语言的融合为各种应用领域开辟了新的可能性。例如，LVLMs 广泛应用于图像描述 [ 67, 68]，为图像生成描述性文本，以及视觉问答（VQA），根据图像内容回答问题。这些模型也被用于内容审核，结合文本和视觉输入检测不当或有害内容，以及创意产业，实现基于视觉输入生成创意叙事等任务 [ 21, 22]。除此之外，专业应用还包括医学影像用于诊断分析 [ 20]、自动驾驶用于视觉场景理解 [ 19]以及教育用于生成多模态教学内容。尽管 LVLMs 能力令人印象深刻，但它们仍面临若干重大挑战。可扩展性仍然是一个关键问题，因为多模态数据的集成需要更多的计算资源，无论是在训练还是在推理过程中。此外，对对抗性输入的鲁棒性，特别是在多模态环境中，是一个日益增长的担忧。对抗性攻击可以利用文本和视觉输入之间的交互，导致意外或不安全的输出[23, 24, 25]。偏差和公平性也是关键问题，因为 LVLMs 常常从其训练数据中继承偏差，这可能导致在敏感环境中产生不公平或有害的输出[42, 69]。最后，安全和一致性是持续的挑战，因为 LVLMs 容易产生有毒或误导性内容，这是由于其训练中的差距或无法理解多模态查询所致。

2.2 Unique Vulnerabilities of LVLMs
2.2 LVLMs 的独特脆弱性

The integration of visual modalities into LLMs has enhanced LVLMs’ multimodal capabilities but also introduced unique vulnerabilities. These include new attack surfaces from visual inputs and the degradation of safety alignment during fine-tuning, both of which compromise the model’s robustness and reliability. Details as follows:
将视觉模态整合到 LLMs 中增强了 LVLMs 的多模态能力，但也引入了独特的漏洞。这些漏洞包括来自视觉输入的新攻击面以及微调过程中安全对齐的退化，这两者都损害了模型的鲁棒性和可靠性。具体如下：

$\bullet$ Expansion Risks Introduced by Visual Inputs. The integration of visual modalities into LLMs inherently leads to an expansion of attack surfaces, exposing models to new security risks [23, 40, 41, 42, 28]. In LLMs, adversarial attacks are constrained to the discrete nature of textual input [70, 71], making such manipulations more demanding, and defenses only need to address textual vulnerabilities. However, the introduction of visual inputs exposes the model to the inherently continuous and high-dimensional visual input space, which serves as a weak link [72, 25]. These characteristics make visual adversarial examples fundamentally challenging to defend against. Consequently, the transition from a purely textual domain to a composite textual-visual domain not only broadens the vulnerability surfaces but also escalates the complexity and burden of defensive measures.
$\bullet$ 视觉输入引入的扩展风险。将视觉模态集成到 LLMs 中会内在地导致攻击面的扩展，使模型面临新的安全风险[23, 40, 41, 42, 28]。在 LLMs 中，对抗性攻击受限于文本输入的离散性[70, 71]，使得此类操纵更加困难，防御措施只需解决文本漏洞。然而，引入视觉输入使模型暴露于本质上连续且高维的视觉输入空间，这成为薄弱环节[72, 25]。这些特性使得视觉对抗性示例从根本上难以防御。因此，从纯文本领域过渡到文本-视觉复合领域不仅扩大了漏洞面，还增加了防御措施的复杂性和负担。

$\bullet$ Degradation of Safety During Fine-Tuning. Degradation of Safety During Fine-Tuning. Visual Instruction Tuning has become essential for enabling Large Language Models (LLMs) to process multimodal inputs by integrating a pre-trained LLM with a vision encoder through a projector layer [15, 13, 14, 61]. This process allows LVLMs to reason across modalities, addressing tasks beyond the capabilities of language-only models. However, their performance depends heavily on the underlying vision and language components, making them susceptible to vulnerabilities caused by misalignment between these modules. A significant limitation of current fine-tuning practices is the freezing of the vision encoder while updating only the projector layer and LLM. This approach leaves the vision encoder without robust safety defenses, exposing it to adversarial or harmful inputs. Additionally, the lack of safety-aware training often leads to the degradation of the model’s pre-trained safety alignment, a phenomenon referred to as catastrophic forgetting [42, 49, 73, 74]. As a result, the model becomes increasingly prone to generating unsafe outputs, particularly in response to adversarial prompts. This degradation is further amplified when a larger portion of the model’s parameters are fine-tuned [40], as extensive updates disrupt the original safety alignment. Consequently, fine-tuning that prioritizes performance without adequately addressing safety risks can unintentionally increase the model’s vulnerabilities. This underscores the critical importance of developing training strategies that maintain both safety and performance, particularly as LVLMs are adapted to new multimodal tasks and domains.
$\bullet$ 微调过程中的安全性退化。微调过程中的安全性退化。视觉指令微调已成为使大型语言模型（LLMs）能够处理多模态输入的关键，通过投影层将预训练的 LLM 与视觉编码器集成[15, 13, 14, 61]。这个过程使 LVLMs 能够在模态间进行推理，处理仅靠语言模型无法完成的任务。然而，它们的性能严重依赖于底层的视觉和语言组件，容易受到这些模块间错位导致的漏洞影响。当前微调实践的一个显著局限性是在仅更新投影层和 LLM 时冻结视觉编码器。这种方法使视觉编码器缺乏强大的安全防御，使其容易受到对抗性或有害输入的攻击。此外，缺乏安全意识训练常常导致模型预训练的安全对齐性退化，这种现象被称为灾难性遗忘[42, 49, 73, 74]。因此，模型越来越容易生成不安全的输出，尤其是在对抗性提示的响应下。当模型更大一部分参数被微调时，这种退化会进一步加剧[40]，因为大量的更新会破坏原有的安全对齐。因此，那种优先考虑性能而未能充分解决安全风险的微调可能会无意中增加模型的风险。这突显了开发既能保持安全又能提升性能的训练策略的极端重要性，尤其是在大型视觉语言模型被应用于新的多模态任务和领域时。

2.3 Access Capabilities 2.3 访问能力

The interaction with LVLMs, whether for attacks or defenses can be categorized based on the knowledge set $\mathcal{K}$ about the model $f_{\theta}$ accessible to the entity (attacker or defender). The knowledge may encompass elements such as model parameters $\theta$ , model architecture $\mathcal{A}_{\theta}$ , gradients $\nabla_{\theta}\mathcal{L}$ , input data $x$ , and output data $y$ . Based on the scope of accessible knowledge, three distinct capabilities are defined as follows:
与大型视觉语言模型（LVLMs）的交互，无论是用于攻击还是防御，都可以根据实体（攻击者或防御者）可访问的知识集 $\mathcal{K}$ 关于模型 $f_{\theta}$ 进行分类。这些知识可能包括模型参数 $\theta$ 、模型架构 $\mathcal{A}_{\theta}$ 、梯度 $\nabla_{\theta}\mathcal{L}$ 、输入数据 $x$ 和输出数据 $y$ 等元素。根据可访问知识的范围，定义了三种不同的能力，如下所示：

$\bullet$ White-box Capability. White-box capability represents the highest level of access, where all internal details of the model are fully available, including the model parameters $\theta$ , the model architecture $\mathcal{A}_{\theta}$ , and the gradients $\nabla_{\theta}\mathcal{L}$ . The knowledge set is formally defined as:
$\bullet$ 白盒能力。白盒能力代表最高级别的访问权限，其中模型的所有内部细节都完全可用，包括模型参数 $\theta$ 、模型架构 $\mathcal{A}_{\theta}$ 和梯度 $\nabla_{\theta}\mathcal{L}$ 。知识集正式定义为：

\mathcal{K}_{\text{W}}=\{x,y,\theta,\mathcal{A}_{\theta},\nabla_{\theta}% \mathcal{L}\mid x\in\mathcal{X},y=f_{\theta}(x)\},

(1)

this level of access enables precise computation of gradients, making it possible to craft adversarial inputs or design highly effective defense mechanisms. White-box scenarios are typically used in controlled research environments.
这种级别的访问权限能够精确计算梯度，从而可以制作对抗性输入或设计非常有效的防御机制。白盒场景通常用于受控的研究环境。

$\bullet$ Gray-box Capability. Gray-box capability assumes partial access to the model’s internal details, such as its architecture $\mathcal{A}_{\theta}$ or intermediate feature representations, but lacks full knowledge of $\theta$ or $\nabla_{\theta}\mathcal{L}$ . The knowledge set is defined as:
$\bullet$ 灰盒能力。灰盒能力假设对模型内部细节的部分访问，例如其架构 $\mathcal{A}_{\theta}$ 或中间特征表示，但缺乏对 $\theta$ 或 $\nabla_{\theta}\mathcal{L}$ 的完全了解。知识集定义为：

\mathcal{K}_{\text{G}}=\{x,y,\mathcal{A}_{\theta}\mid x\in\mathcal{X},y=f_{% \theta}(x)\},

(2)

this level of access is common in scenarios where the model architecture is publicly known or inferred. For example, a surrogate model $\mathcal{S_{\theta}}$ can be trained to approximate the target model’s behavior, which can then be used for crafting adversarial inputs or testing defensive strategies.
这种级别的访问在模型架构公开或可推断的情况下很常见。例如，可以训练一个代理模型 $\mathcal{S_{\theta}}$ 来近似目标模型的行为，然后用于构建对抗性输入或测试防御策略。

$\bullet$ Black-box Capability. Black-box capability represents the lowest level of access, where the entity has no internal information about the model. The only accessible data are the input-output pairs $\{x,f_{\theta}(x)\}$ without any direct knowledge of $\theta$ , $\mathcal{A}_{\theta}$ , and $\nabla_{\theta}\mathcal{L}$ . The knowledge set is defined as:
$\bullet$ 黑盒能力。黑盒能力代表最低级别的访问权限，实体对模型内部没有任何信息。唯一可访问的数据是输入输出对 $\{x,f_{\theta}(x)\}$ ，而没有任何关于 $\theta$ 、 $\mathcal{A}_{\theta}$ 和 $\nabla_{\theta}\mathcal{L}$ 的直接知识。知识集定义为：

\mathcal{K}_{\text{B}}=\{(x,f_{\theta}(x))\mid x\in\mathcal{X}\},

(3)

in this scenario, interactions are limited to querying the model and observing its outputs. This setting are representative of real-world conditions, where attackers or defenders must work without any internal access to the model.
在这种情况下，交互仅限于查询模型并观察其输出。这种设置代表了现实世界条件，攻击者或防御者必须在不具备模型内部访问权限的情况下工作。

2.4 Attack Objectives 2.4 攻击目标

In the safety domain of Large Vision-Language Models (LVLMs), attacks typically fall into three main categories, each with distinct objectives:
在大视觉语言模型（LVLMs）的安全领域，攻击通常分为三大主要类别，每个类别都有不同的目标：

$\bullet$ Targeted Attacks. Targeted attacks are designed to manipulate the model’s output for specific inputs $x\in\mathcal{X}$ , driving the model $f_{\theta}$ to produce a predefined, incorrect output $y_{\text{target}}\in\mathcal{Y}$ , irrespective of the correct output $y$ . These attacks may involve adversarial perturbations to the input, such as subtle modifications to images or text, or non-adversarial methods, such as crafted queries that exploit weaknesses in the model’s reasoning. The objective can be described as:
$\bullet$ 针对性攻击。针对性攻击旨在通过特定输入操纵模型的输出 $x\in\mathcal{X}$ ，迫使模型 $f_{\theta}$ 产生预定义的错误输出 $y_{\text{target}}\in\mathcal{Y}$ ，而忽略正确输出 $y$ 。这些攻击可能涉及对输入的对抗性扰动，例如对图像或文本进行细微修改，或使用非对抗性方法，例如利用模型推理缺陷的精心设计的查询。其目标可以描述为：

\arg\min_{x_{\text{mod}}}\mathcal{L}(f_{\theta}(x_{\text{mod}}),y_{\text{% target}}),

(4)

where $x_{\text{mod}}$ represents either adversarially perturbed inputs or specially crafted benign queries.
其中 $x_{\text{mod}}$ 代表对抗性扰动输入或特别设计的良性查询。

$\bullet$ Untargeted Attacks. Untargeted attacks aim to degrade the model’s overall performance by causing it to produce any incorrect output $y^{\prime}\neq y$ . These attacks are not constrained by specific target outputs and can involve adversarial modifications to the input or non-adversarial strategies, such as exploiting ambiguities in the training data or the model’s inherent biases. The goal is defined as:
$\bullet$ 非针对性攻击。非针对性攻击旨在通过使模型产生任何错误输出 $y^{\prime}\neq y$ 来降低其整体性能。这些攻击不受特定目标输出的限制，可能涉及对输入的对抗性修改或非对抗性策略，例如利用训练数据中的模糊性或模型的固有偏差。其目标是：

\arg\min_{x_{\text{mod}}}\mathcal{L}(f_{\theta}(x_{\text{mod}}),y),\quad\text{% subject to }f_{\theta}(x_{\text{mod}})\neq y,

(5)

this category focuses on reducing the model’s accuracy across tasks and scenarios.
这一类别专注于降低模型在各项任务和场景中的准确性。

$\bullet$ Jailbreak Attacks. Jailbreak attacks seek to bypass the model’s safety mechanisms or ethical constraints, compelling it to generate harmful or restricted outputs $y_{\text{restricted}}$ . Unlike targeted or untargeted attacks, jailbreak methods often do not require adversarial perturbations to the input; instead, they exploit flaws in the model’s safety alignment or prompt-handling mechanisms. Such attacks may involve carefully designed queries or prompts that trick the model into violating its safety policies. The objective is defined as:
$\bullet$ 越狱攻击。越狱攻击旨在绕过模型的安全机制或伦理约束，迫使模型生成有害或受限的输出 $y_{\text{restricted}}$ 。与定向攻击或非定向攻击不同，越狱方法通常不需要对输入进行对抗性扰动；相反，它们利用模型安全对齐或提示处理机制中的缺陷。此类攻击可能涉及精心设计的查询或提示，诱使模型违反其安全策略。其目标定义为：

\arg\min_{x}\mathcal{R}(f_{\theta}(x))\quad\text{subject to }y_{\text{% restricted}}\in\mathcal{O},

(6)

where $\mathcal{R}$ measures the effectiveness of the model’s safety mechanisms, and $\mathcal{O}$ is the set of restricted outputs.
其中 $\mathcal{R}$ 衡量模型安全机制的有效性， $\mathcal{O}$ 是受限输出的集合。

2.5 Attack Strategies 2.5 攻击策略

Basic attack strategies can be categorized based on the type of manipulation applied to the model. These strategies target different components of the model, ranging from input perturbations to poisoning the training data. We present five main categories of attack strategies as follows:
基本的攻击策略可以根据对模型施加的操纵类型进行分类。这些策略针对模型的各个组件，范围从输入扰动到污染训练数据。我们介绍五种主要的攻击策略，如下所示：

$\bullet$ Perturbation-based Attacks. Perturbation-based attacks involve making small, often imperceptible changes to the input data in order to mislead the model into making incorrect predictions [75, 76, 77]. These attacks typically rely on gradient-based methods to identify the most vulnerable parts of the input and introduce perturbations that maximize the model’s loss function. Examples include adversarial image attacks where slight modifications in pixel values cause misclassification without significantly altering the visual appearance to a human observer [78, 79].
$\bullet$ 基于扰动的攻击。基于扰动的攻击涉及对输入数据进行微小且通常难以察觉的修改，以误导模型做出错误预测[75, 76, 77]。这些攻击通常依赖于基于梯度的方法来识别输入中最易受攻击的部分，并引入扰动以最大化模型的损失函数。例如，对抗性图像攻击中，对像素值的微小修改会导致错误分类，而不会显著改变人类观察者的视觉外观[78, 79]。

$\bullet$ Transfer-based Attacks. Transfer-based attacks exploit the phenomenon of transferability, where adversarial examples crafted for one model can often be used to deceive another model with similar architecture or function [80, 81, 82]. In these approaches, attackers generate adversarial examples using a source model and then test them on a target model. This type of attack is particularly useful in black-box settings where the attacker has no direct access to the target model’s parameters or training data, but can still craft adversarial examples by leveraging knowledge of a related model.
$\bullet$ 基于迁移的攻击。基于迁移的攻击利用迁移性现象，即针对一个模型制作的对抗样本通常可以用来欺骗具有相似架构或功能的另一个模型[80, 81, 82]。在这些方法中，攻击者使用源模型生成对抗样本，然后在目标模型上测试它们。这种类型的攻击在黑盒设置中特别有用，攻击者无法直接访问目标模型的参数或训练数据，但仍然可以通过利用相关模型的知识来制作对抗样本。

$\bullet$ Prompt-based Attacks. Prompt-based attacks focus on manipulating the input prompt (in the case of language models, this could be a sentence or question [70, 83, 84], and for vision-language models, a textual prompt associated with an image [41, 85]). The goal is to craft a prompt that causes the model to produce undesirable outputs or make incorrect predictions. In vision-language models (LVLMs), for example, an attacker may modify the textual prompt to confuse the model’s understanding of the image, thereby generating adversarial results. These attacks often leverage natural language understanding to create subtle prompt variations that lead to model misbehavior.
$\bullet$ 基于提示的攻击。基于提示的攻击专注于操控输入提示（在语言模型的情况下，这可能是一个句子或问题[ 70, 83, 84]，在视觉语言模型的情况下，这是一个与图像相关的文本提示[ 41, 85]）。其目标是为模型制作一个提示，使其产生不希望的输出或做出错误的预测。例如，在视觉语言模型（LVLMs）中，攻击者可能会修改文本提示，以混淆模型对图像的理解，从而生成对抗性结果。这些攻击通常利用自然语言理解来创建微妙的提示变化，导致模型行为错误。

$\bullet$ Poison-based Attacks. Poison-based attacks target the model’s training data by injecting malicious data points designed to influence the model’s behavior during training [86, 30]. These attacks can be used to introduce subtle biases into the model or degrade its performance on specific tasks. The poisoned data is often carefully crafted to either cause misclassifications or degrade the generalization ability of the model, without being immediately apparent to the model trainer. This type of attack is particularly concerning for models that are continuously updated with new data or are trained on large datasets collected from various sources.
$\bullet$ 基于中毒的攻击。基于中毒的攻击通过注入恶意数据点来针对模型的训练数据，这些数据点旨在影响模型在训练过程中的行为[86, 30]。这些攻击可用于向模型引入微妙的偏差或降低其在特定任务上的性能。中毒数据通常经过精心设计，旨在要么导致错误分类，要么降低模型的泛化能力，而不会立即被模型训练者察觉。对于那些不断用新数据更新或在大规模、来自各种来源的数据集上训练的模型，这种类型的攻击尤其令人担忧。

$\bullet$ Trigger-based Attacks. Trigger-based attacks involve embedding specific triggers (such as a particular pattern or set of features) into the training data or inputs [87, 88, 89, 90]. When these triggers are present in the input data during inference, the model’s behavior is altered in a predefined way, often causing the model to misclassify the input. These attacks can be highly effective, as the trigger may only need to be present in a small portion of the data, making them difficult to detect. In some cases, the trigger may be imperceptible or unobtrusive, making it challenging for both humans and automated defenses to identify the malicious input.
$\bullet$ 基于触发器的攻击。基于触发器的攻击涉及将特定的触发器（如特定模式或一组特征）嵌入训练数据或输入中[87, 88, 89, 90]。当这些触发器在推理时的输入数据中出现时，模型的行为会以预定义的方式改变，通常导致模型错误分类输入。这些攻击可以非常有效，因为触发器只需要存在于数据的一小部分中，使其难以检测。在某些情况下，触发器可能难以察觉或不显眼，这使得人类和自动化防御都难以识别恶意输入。

3 Attack 3 攻击

TABLE II: Summary of key characteristics of reviewed methods in Inference-Phase Attacks (§ 3.1). T, U, and J represent Targeted, Untargeted, and Jailbreak Attacks, respectively. \faFileTextO and \faFilePhotoO indicate Textual and Visual modalities, while \faCommentO and \faCommentsO denote Single-turn and Multi-turn attack modes, respectively.
表 II：推理阶段攻击（§ 3.1）中审查方法的要点总结。T、U 和 J 分别代表定向攻击、非定向攻击和越狱攻击。\faFileTextO 和 \faFilePhotoO 表示文本和视觉模态，而 \faCommentO 和 \faCommentsO 分别指单轮攻击模式和多轮攻击模式。

			Objectives 目标
Methods	Venue 会议地点	Attack Strategies 攻击策略	T	U	J	Trans. 翻译	Modal. 模态。	Turns 转换	Victim Model 受害者模型
White-Box Attacks (§ 3.1.1) 白盒攻击（§ 3.1.1）
[23]	[AAAI’24]	Perturbation-based 基于扰动的	✗	✗	✓	✗	\faFilePhotoO	\faCommentO	LLaVA/MiniGPT-4/InstructBLIP
[24]	[ICCV’23]	Perturbation-based 基于扰动的	✓	✓	✗	✗	\faFilePhotoO	\faCommentO	OpenFlamingo
[91]	[arXiv’23]	Perturbation-based 基于扰动的	✓	✓	✗	✗	\faFilePhotoO	\faCommentsO	LLaVA/PandaGPT
[27]	[ICML’24]	Perturbation-based 基于扰动的	✓	✓	✓	✗	\faFilePhotoO	\faCommentO	LLaVA
[92]	[ICLR’24]	Perturbation/Trigger-based 扰动/触发器	✗	✓	✗	✗	\faFilePhotoO	\faCommentO	BLIP/BLIP-2/InstructBLIP/MiniGPT-4
[93]	[arXiv’24]	Perturbation-based 基于扰动的	✗	✓	✓	✗	\faFilePhotoO \faFileTextO	\faCommentO	LLaVA/MiniGPT-4/InstructBLIP/BLIP-2
[94]	[COLM’24]	Perturbation-based 基于扰动的	✓	✗	✗	✗	\faFilePhotoO	\faCommentO	MiniGPT-4/OpenFlamingo/LLaVA
[40]	[ECCV’24]	Perturbation/Prompt-based 扰动/提示	✗	✓	✓	✓	\faFilePhotoO \faFileTextO	\faCommentO	LLaVA/GeminiPro/GPT-4V
[95]	[ICLR’24]	Perturbation-based 基于扰动的	✓	✓	✗	✓	\faFilePhotoO	\faCommentO	OpenFlamingo/BLIP-2/InstructBLIP
[96]	[ICLR’24]	Perturbation-based 基于扰动的	✓	✓	✗	✗	\faFilePhotoO	\faCommentO	MiniGPT-v2
[26]	[MM’24]	Perturbation-based 基于扰动的	✗	✓	✓	✗	\faFilePhotoO \faFileTextO	\faCommentO	MiniGPT-4
[97]	[arXiv’24]	Perturbation-based 基于扰动的	✗	✓	✓	✓	\faFilePhotoO \faFileTextO	\faCommentO	LLaVA/MiniGPT-4/InstructBLIP
[98]	[arXiv’24]	Perturbation-based 基于扰动的	✓	✗	✗	✓	\faFilePhotoO \faFileTextO	\faCommentO	LLaVA/InstructBLIP/BLIP-2
[99]	[arXiv’24]	Perturbation-based 基于扰动的	✓	✓	✗	✗	\faFilePhotoO	\faCommentO	LLaVA
Gray-Box Attacks (§ 3.1.2) 灰盒攻击（§ 3.1.2）
[25]	[NeurIPS’23]	Transfer-based 迁移式	✓	✗	✗	✓	\faFilePhotoO	\faCommentO	BLIP/UniDiffuser/other 4 BLIP/UniDiffuser/其他 4
[100]	[NeurIPS’23]	Transfer-based 基于迁移	✗	✓	✓	✓	\faFilePhotoO	\faCommentO	Bard
[85]	[ICLR’24]	Transfer/Prompt-based 迁移/提示	✗	✓	✓	✓	\faFilePhotoO \faFileTextO	\faCommentO	LLaVA
[101]	[arXiv’24]	Transfer-based 基于迁移的	✗	✓	✓	✓	\faFilePhotoO	\faCommentO	MiniGPT-4/MiniGPT-V2/other 3 MiniGPT-4/MiniGPT-V2/其他 3
[102]	[ICML’24]	Transfer-based 基于迁移	✗	✓	✓	✓	\faFilePhotoO	\faCommentO	LLaVA/InstructBLIP/BLIP
[103]	[arXiv’24]	Transfer-based 基于迁移的	✗	✓	✓	✓	\faFilePhotoO	\faCommentO	LLaVA/PandaGPT
[101]	[MM’24]	Transfer-based 基于迁移	✗	✓	✗	✓	\faFilePhotoO	\faCommentO	LLaVA/Otter/other 5 LLaVA/Otter/其他 5
Black-Box Attacks (§ 3.1.3) 黑盒攻击（§ 3.1.3）
[41]	[arXiv’23]	Prompt-based 基于提示	✗	✓	✓	✓	\faFilePhotoO \faFileTextO	\faCommentO	LLaVA/MiniGPT-4/CogLVLM
[104]	[NeurIPS’24]	Prompt-based 基于提示	✗	✓	✗	✓	\faFilePhotoO \faFileTextO	\faCommentO	LLaVA/MiniGPT-4/InstructBLIP/GPT-4V
[105]	[arXiv’24]	Prompt-based 基于提示	✗	✓	✓	✓	\faFilePhotoO \faFileTextO	\faCommentO	LLaVA/Qwen-VL/OmniLMM/other 2 LLaVA/Qwen-VL/OmniLMM/其他 2
[106]	[arXiv’24]	Prompt-based 基于提示	✗	✓	✓	✓	\faFilePhotoO \faFileTextO	\faCommentO	GPT-4V/GPT-4o/Qwen-VL/other 4 GPT-4V/GPT-4o/Qwen-VL/其他 4
[107]	[arXiv’24]	Prompt-based 基于提示	✗	✓	✓	✓	\faFilePhotoO \faFileTextO	\faCommentsO	MiniGPT-4/LLaVA/InstructBLIP/Chameleon
[108]	[arXiv’24]	Prompt-based 基于提示	✗	✓	✓	✓	\faFilePhotoO \faFileTextO	\faCommentO	GPT-4o/Qwen-VL/Claude/other 3 GPT-4o/Qwen-VL/Claude/其他 3
[109]	[arXiv’24]	Prompt-based 基于提示	✗	✓	✓	✓	\faFilePhotoO \faFileTextO	\faCommentO	LLaVA/DeepSeek-VL/Qwen-VL/other 7 LLaVA/DeepSeek-VL/Qwen-VL/其他 7

Extensive research has been conducted to investigate strategies for attacking Large Vision-Language Models (LVLMs). These strategies are broadly classified into two principal categories: Inference-Phase Attacks (§ 3.1) and Training-Phase Attacks (§ 3.2), each addressing distinct vulnerabilities across different stages of the LVLM lifecycle.
针对大型视觉语言模型（LVLMs）的攻击策略已进行了广泛的研究。这些策略主要分为两类：推理阶段攻击（§ 3.1）和训练阶段攻击（§ 3.2），分别针对 LVLM 生命周期不同阶段的独特漏洞。

3.1 Inference-Phase Attacks
3.1 推理阶段攻击

Inference-Phase Attacks exploit meticulously crafted malicious inputs to compromise LVLMs without necessitating any modifications to the model’s parameters or architecture. This attribute renders them the most prevalently employed form of attack. Given that these attacks often employ multiple strategies simultaneously, they are systematically categorized based on their attack capabilities, as outlined in (§ 2.3). Specifically, they are divided into White-Box Attacks (§ 3.1.1), Gray-Box Attacks (§ 3.1.2), and Black-Box Attacks (§ 3.1.3), contingent on the degree of knowledge the adversary possesses regarding the target LVLMs.
推理阶段攻击利用精心设计的恶意输入来破坏大型视觉语言模型，而无需对模型的参数或架构进行任何修改。这一特性使它们成为最普遍使用的攻击形式。考虑到这些攻击通常同时使用多种策略，它们根据攻击能力系统地分类，如(§ 2.3)所述。具体来说，它们分为白盒攻击(§ 3.1.1)、灰盒攻击(§ 3.1.2)和黑盒攻击(§ 3.1.3)，这取决于攻击者对目标大型视觉语言模型的了解程度。

3.1.1 White-Box Attacks 3.1.1 白盒攻击

As the most stringent requirements for attack conditions method, White-box Attacks necessitate complete access to the model’s internal knowledge. As illustrated in the top of Fig. 2, these attacks involve introducing adversarial noise to images and iteratively refining the noise using gradients from the model’s intermediate layers to achieve targeted outputs. Based on the type of modalities subjected to perturbations, they are further classified into two categories.
作为攻击条件方法最严格的要求，白盒攻击需要完全访问模型的内部知识。如图 2 顶部所示，这些攻击涉及向图像中引入对抗性噪声，并使用模型中间层的梯度迭代地优化噪声，以实现目标输出。根据受扰动模态的类型，它们进一步分为两类。

$\bullet$ Single-Modality. Based on the vulnerabilities of LVLMs. Qi et al. [23] propose a classic white-box attack method for crafting adversarial examples that can exploit the visual modality of LVLM to induce unsafe or misleading outputs, even when the model has been carefully aligned to follow ethical guidelines or constraints. The attack is formulated as an optimization problem, where the adversarial example $\widehat{v}_{adv}$ is selected from a perturbation space $\mathscr{B}$ to minimize the log-probability of the model’s output for the target class $y_{i}$ . The attack objective is mathematically expressed as:
$\bullet$ 单模态。基于 LVLMs 的漏洞，Qi 等人[ 23]提出了一种经典的白盒攻击方法，用于制作对抗样本，该方法可以利用 LVLM 的视觉模态来诱导不安全或误导性的输出，即使模型已经经过精心校准以遵循道德准则或约束。该攻击被表述为一个优化问题，其中对抗样本 $\widehat{v}_{adv}$ 从扰动空间 $\mathscr{B}$ 中选取，以最小化模型对目标类别 $y_{i}$ 输出的对数概率。攻击目标用数学公式表示为：

v_{adv}:=\underset{\widehat{v}_{adv}\in\mathscr{B}}{\operatorname{argmin}}\sum% _{i=1}^{m}-\log\left(p\left(y_{i}\mid\widehat{v}_{adv}\right)\right),

(7)

where the goal is to force the model to misclassify or generate undesired outputs. Using the same approach to generate adversarial images, Schlarmann et al. [24] further investigate the success rates of both targeted and untargeted attacks against the OpenFlamingo model [110]. To expand the applicability of adversarial attacks, Bailey et al. [27] propose a general-purpose Behaviour Matching algorithm for generating adversarial images with enhanced context transferability. This algorithm enables adversarial examples to maintain their effectiveness across diverse scenarios and tasks. Additionally, [27] introduces Prompt Matching method, which is designed to train hijacking models capable of mimicking the behavior elicited by an arbitrary text prompt. Luo et al. [95] introduce the concept of cross-prompt adversarial transferability. The proposed CroPA [95] method refines visual adversarial perturbations using learnable prompts, specifically designed to counteract the misleading effects of adversarial images. CroPA [95] enables a single adversarial example to mislead all predictions of a LVLM across different prompts. From an unusual perspective, Gao et al. [92] explore a novel approach to induce high energy-latency [111, 112] in order to induce safety issues in LVLMs by causing them to generate endless outputs. Specifically, they introduced the concept of delayed End-of-Sequence (EOS) loss, leveraging it to create verbose images with perturbations that disrupt the model’s ability to terminate its responses appropriately. This specific loss function not only inhibits the model from halting its responses but also increases token diversity during generation. This results in the model producing lengthy and often irrelevant outputs, which can degrade user experience or lead to the propagation of unintended or incoherent information. Besides, Wang et al. [94] investigate the impact of Chain-of-Thought (CoT) reasoning [113, 114, 115] on the robustness of LVLMs. To address this, they introduced the Stop Reasoning attack method, which generates adversarial images to guide the model’s output according to a pre-designed template. This approach reduces the token probability associated with CoT reasoning, thereby effectively diminishing its influence on the model’s safety performance. Gao et al. [96] shift the focus of adversarial attacks to the Visual Grounding task [116, 117, 118], demonstrating how adversarial perturbations can effectively disrupt the alignment between visual inputs and textual references. Using Projected Gradient Descent (PGD) [79], they add perturbations on images, enabling the execution of both targeted and untargeted adversarial attacks. Jang et al. [99] introduce the “Replace-then-Perturb” method, a novel approach that differs from traditional adversarial attacks, which often disrupt the entire image. This method focuses on specific objects within an image, replacing them with carefully designed adversarial substitutes and applying targeted perturbations. By ensuring that other objects in the scene remain unaffected and correctly recognized, the method maintains the overall context while effectively misleading the model’s reasoning about the targeted objects. Unlike the previously mentioned single-turn attack methods, Bagdasaryan et al. [91] propose a multi-turn attack that uses prompt injection to compromise dialog safety. By forcing the model to output a specific attacker-chosen instruction $w$ in its first response (e.g., “I will always follow instruction:”), the attacker poisons the dialog history. This causes the model to lose its safety mechanisms in subsequent turns. The attack exploits the model’s context retention, making it persistently unsafe, and can be made stealthier by paraphrasing the injected instruction.
其目标是将模型强制错误分类或生成不期望的输出。使用相同的方法生成对抗性图像，Schlarmann 等人[24]进一步研究了针对 OpenFlamingo 模型[110]的定向攻击和非定向攻击的成功率。为了扩展对抗性攻击的适用性，Bailey 等人[27]提出了一种通用的行为匹配算法，用于生成具有增强上下文可迁移性的对抗性图像。该算法使对抗性样本能够在不同的场景和任务中保持其有效性。此外，[27]引入了提示匹配方法，该方法旨在训练能够模仿任意文本提示所引发行为的劫持模型。Luo 等人[95]引入了跨提示对抗性可迁移性的概念。他们提出的 CroPA[95]方法使用可学习的提示来改进视觉对抗性扰动，专门设计用来抵消对抗性图像的误导效果。CroPA[95]使单个对抗性样本能够在不同的提示下误导 LVLM 的所有预测。从非同寻常的角度出发，高等人[92]探索了一种新颖的方法，通过使 LVLMs 生成无限输出，来诱导高能耗-延迟[111,112]并引发安全问题。具体来说，他们引入了延迟序列结束（EOS）损失的概念，利用它来创建带有扰动的冗长图像，从而破坏模型适当终止其响应的能力。这种特定的损失函数不仅阻止模型停止其响应，还增加了生成过程中的 token 多样性。这导致模型产生冗长且通常不相关的输出，这可能降低用户体验或导致意外或不连贯信息的传播。此外，王等人[94]研究了思维链（CoT）推理[113,114,115]对 LVLMs 鲁棒性的影响。为了解决这个问题，他们引入了停止推理攻击方法，该方法生成对抗性图像，根据预设计的模板引导模型的输出。这种方法降低了与 CoT 推理相关的 token 概率，从而有效地削弱了其对模型安全性能的影响。高等人[96]将对抗攻击的焦点转移到视觉定位任务[116,117,118]上，展示了对抗扰动如何有效破坏视觉输入与文本参考之间的对齐。他们使用投影梯度下降（PGD）[79]，在图像上添加扰动，从而能够执行有目标和无目标的对抗攻击。蒋等人[99]引入了“替换后扰动”方法，这是一种与传统对抗攻击（通常破坏整个图像）不同的新方法。该方法专注于图像中的特定对象，用精心设计的对抗替代物替换它们，并施加有目标的扰动。通过确保场景中的其他对象保持未受影响且被正确识别，该方法在保持整体上下文的同时有效误导模型对目标对象的推理。与之前提到的单轮攻击方法不同，巴加达萨扬等人[91]提出了一种多轮攻击，该攻击使用提示注入来破坏对话安全。通过迫使模型在其首次回应中输出攻击者选择的特定指令 $w$ （例如，“我将始终遵循指令：”），攻击者污染了对话历史。这导致模型在后续回合中失去了安全机制。该攻击利用了模型保持上下文的能力，使其持续不安全，并且可以通过改写注入的指令来使其更加隐蔽。

$\bullet$ Cross-Modality. While single-modality attacks target visual components of LVLMs, cross-modality attacks exploit the interaction between modalities to achieve more robust and transferable adversarial effects. These methods aim to misalign the model’s multimodal understanding by jointly perturbing both visual and textual inputs, leveraging the complex dependencies between modalities to amplify the attack’s impact. Wang et al. [26] propose the Universal Master Key (UMK) method, comprises an adversarial image prefix and an adversarial text suffix. Firstly, UMK [26] establish a corpus containing several few-shot examples of harmful sentences $Y:=\{y_{i}\}_{i=1}^{m}$ . The methodology for embedding toxic semantics into the adversarial image prefix $\widehat{v}_{\text{adv}}$ is straightforward: UMK [26] initialize $\widehat{v}_{\text{adv}}$ with random noise and optimize it to maximize the generation probability of this few-shot corpus in the absence of text input. The optimization objective is formulated as follows:
$\bullet$ 跨模态。虽然单模态攻击针对 LVLMs 的视觉组件，但跨模态攻击利用模态间的交互来实现更鲁棒和可迁移的对抗效果。这些方法旨在通过联合扰动视觉和文本输入来错位模型的跨模态理解，利用模态间的复杂依赖关系来放大攻击的影响。Wang 等人[ 26]提出了通用主密钥（UMK）方法，包含一个对抗图像前缀和一个对抗文本后缀。首先，UMK[ 26]建立了一个包含几个有害句子少样本示例的语料库 $Y:=\{y_{i}\}_{i=1}^{m}$ 。将有害语义嵌入对抗图像前缀 $\widehat{v}_{\text{adv}}$ 的方法很简单：UMK[ 26]用随机噪声初始化 $\widehat{v}_{\text{adv}}$ ，并优化它以在无文本输入的情况下最大化该少样本语料库的生成概率。优化目标如下：

v_{\text{adv}}:=\underset{\widehat{v}_{\text{adv}}}{\operatorname{argmin}}\sum% _{i=1}^{m}-\log\left(p(y_{i}\mid\widehat{x}_{\text{adv}},\varnothing)\right),

(8)

where $\varnothing$ denotes an empty text input. This optimization problem can be efficiently solved using prevalent techniques in image adversarial attacks, such as PGD [79]. To maximize the probability of generating affirmative responses, UMK [26] further introduce an adversarial text suffix $\widehat{t}_{\text{adv}}$ in conjunction with the adversarial image prefix $\widehat{v}_{\text{adv}}$ , which is embedded with toxic semantics. This multimodal attack strategy aims to thoroughly exploit the inherent vulnerabilities of LVLMs. The optimization objective is as follows:
其中 $\varnothing$ 表示空文本输入。这个优化问题可以使用图像对抗攻击中的常用技术（如 PGD [ 79]）高效解决。为了最大化生成肯定回答的概率，UMK [ 26] 进一步引入了带有毒性语义的对抗文本后缀 $\widehat{t}_{\text{adv}}$ ，并与对抗图像前缀 $\widehat{v}_{\text{adv}}$ 结合使用。这种多模态攻击策略旨在彻底利用大型视觉语言模型（LVLMs）的固有漏洞。优化目标如下：

v_{\text{adv}},t_{\text{adv}}:=\underset{\widehat{v}_{\text{adv}},\widehat{t}_% {\text{adv}}}{\operatorname{argmin}}\sum_{i=1}^{n}-\log\left(p(y_{i}\mid% \widehat{v}_{\text{adv}},\widehat{t}_{\text{adv}})\right).

(9)

Similar to [26], Ying et al. [97] introduce the Bi-Modal Adversarial Prompt Attack (BAP) method, which perturbs images and rewrites text inputs to compromise the safety mechanisms of LVLMs. BAP [97] crafts the visual perturbation by constructing a query-agnostic corpus, ensuring the model consistently generates positive responses regardless of the query’s harmful intent. Differing from [26], BAP [97] incorporates an iterative refinement of the textual prompt using CoT strategy [119, 120], leverages the reasoning capabilities of LLMs to progressively optimize the textual input, ensuring it aligns with the visual perturbation while effectively bypassing safety mechanisms. This alignment between visual and textual modalities enables the model to produce harmful outputs with higher success and precision. Extending these insights, HADES [40] demonstrates that images can act as alignment backdoors for LVLMs. HADES [40] combines multiple attack strategies in a systematic process: it first removes harmful textual content by embedding it into typography, then pairs it with a harmful image generated using a diffusion model guided by an iteratively refined prompt from an LLM. Finally, an adversarial image is appended to the composite image, effectively eliciting affirmative responses from LVLMs for harmful instructions. Presented strong evidence that the visual modality poses the alignment vulnerability of LVLMs, underscoring the urgent need for further exploration into cross-modal alignment. While [95] introduces cross-prompt attack by using single-modality method. CIA [98] further improves CroPA [95] by employing gradient-based perturbation to inject target tokens into both visual and textual contexts. CIA [98] shifts contextual semantics towards the target tokens instead of preserving the original image semantics, thereby enhancing the cross-prompt transferability of adversarial images.
与[ 26]类似，Ying 等人[ 97]提出了双模态对抗提示攻击（BAP）方法，该方法通过扰动图像和重写文本输入来破坏大型视觉语言模型（LVLMs）的安全机制。BAP[ 97]通过构建一个查询无关的语料库来制作视觉扰动，确保模型无论查询的恶意意图如何，始终生成正面响应。与[ 26]不同，BAP[ 97]采用 CoT 策略[ 119, 120]对文本提示进行迭代优化，利用 LLMs 的推理能力逐步优化文本输入，确保其与视觉扰动一致，同时有效绕过安全机制。这种视觉和文本模态之间的对齐使模型能够以更高的成功率和精度生成有害输出。基于这些见解，HADES[ 40]表明图像可以作为 LVLMs 的校准后门。HADES[ 40]在系统过程中结合了多种攻击策略：首先通过将其嵌入到排版中移除有害文本内容，然后将其与一个由 LLM 迭代优化的提示引导的扩散模型生成的有害图像配对。最后，将对抗性图像附加到合成图像中，有效引出 LVLMs 对有害指令的肯定响应。提供了强有力的证据，表明视觉模态导致了 LVLMs 的校准漏洞，突显了进一步探索跨模态校准的紧迫性。虽然[ 95]通过使用单模态方法引入了跨提示攻击。CIA [ 98]通过采用基于梯度的扰动，将目标标记注入视觉和文本上下文中，进一步改进了 CroPA [ 95]。CIA [ 98]将上下文语义转向目标标记，而不是保留原始图像语义，从而增强了对抗性图像的跨提示可迁移性。

3.1.2 Gray-Box Attacks 3.1.2 灰盒攻击

As a distinctive category of attack methods, Gray-Box Attacks leverage the attacker’s partial knowledge of the model architecture. As depicted in the middle of Fig. 2, despite the absence of access to the model’s complete parameters or gradients, attackers can exploit structural information inherent to LVLMs. For models employing known vision encoders, such as CLIP [12] or BLIP [121], attackers are able to construct an surrogate model set to generate adversarial images analogous to those produced in White-Box Attacks (§ 3.1.1). These adversarial images exhibit sufficient generalization capabilities, enabling effective attacks on other models that utilize the same vision encoder architecture.
作为一种独特的攻击方法，灰盒攻击利用了攻击者对模型架构的部分知识。如图 2 中间所示，尽管无法访问模型的完整参数或梯度，攻击者仍能利用 LVLMs 固有的结构信息。对于采用已知视觉编码器的模型，如 CLIP [12]或 BLIP [121]，攻击者能够构建一个替代模型集，以生成类似于白盒攻击（§ 3.1.1）中产生的对抗图像。这些对抗图像具有足够的泛化能力，能够对使用相同视觉编码器架构的其他模型进行有效攻击。

$\bullet$ Single-Modality. Focusing on visual modality, Zhao et al. [25] conducts both transfer-based and query-based attacks against image-grounded text generation, focusing on adversaries with only black-box system access. CLIP [12] and BLIP [121] are employed as surrogate models to generate adversarial examples by aligning textual and image embeddings, which are subsequently transferred to other LVLMs. This methodology achieves a notably high success rate in generating targeted responses. Dong et al. [100] and Niu et al. [101] both employing more white-box surrogate vision encoders of LVLMs. Dong et al. [100] further investigated the vulnerability of Google’s Bard¹¹1https://bard.google.com/
$\bullet$ 单模态。聚焦于视觉模态，赵等人[ 25]对图像-文本生成进行了基于迁移和基于查询的攻击，专注于只有黑盒系统访问的对手。CLIP[ 12]和 BLIP[ 121]被用作替代模型，通过对齐文本和图像嵌入来生成对抗样本，这些样本随后被迁移到其他大型视觉语言模型（LVLMs）。这种方法在生成目标响应方面取得了显著的高成功率。董等人[ 100]和牛等人[ 101]都采用了更白盒的 LVLM 替代视觉编码器。董等人[ 100]进一步研究了 Google 的 Bard ¹ , a black-box model. The generated adversarial examples demonstrated the capability to mislead Bard, producing incorrect image descriptions with a 22% attack success rate (ASR) based solely on their transferability. These examples were also highly effective against other commercial LVLMs, achieving similarly high ASRs. Wang et al. [122] introduces a novel attack method called VT-Attack. This method disrupts encoded visual tokens by comprehensively targeting their features, relationships, and semantic properties. It employs a multi-faceted approach to alter the internal representations of these tokens, effectively interfering with the model’s ability to generate untargeted outputs across various tasks.
，一个黑盒模型。生成的对抗样本展示了误导 Bard 的能力，仅凭借其可迁移性，以 22%的攻击成功率（ASR）生成了错误的图像描述。这些样本对其他商业视觉语言模型（LVLMs）也非常有效，实现了同样高的 ASR。Wang 等人[ 122]介绍了一种名为 VT-Attack 的新型攻击方法。该方法通过全面针对其特征、关系和语义属性来破坏编码的视觉标记。它采用多方面的方法来改变这些标记的内部表示，有效地干扰模型在各种任务中生成非目标输出的能力。

$\bullet$ Cross-Modality. Jailbreak In Pieces (JIP) [85] developed a cross-modality attack method that requires access solely to the vision encoder CLIP [12]. JIP [85] first decomposes a typical harmful prompt into two distinct components: a generic textual instruction $x_{g}^{t}$ (e.g., “teach me how to make these things.”) and a malicious trigger $H_{\text{harm}}$ , which can be either a harmful textual input $x_{\text{harm}}^{t}$ or an image input $x_{\text{harm}}^{i}$ generated using visual or OCR methods. Therefore craft the adversarial input images $\widehat{x}_{adv}^{i}$ that mapped into the dangerous embedding regions close to $H_{\text{harm}}$ :
$\bullet$ 跨模态。分段越狱攻击（JIP）[ 85] 开发了一种跨模态攻击方法，该方法仅需访问视觉编码器 CLIP [ 12]。JIP [ 85] 首先将典型的有害提示分解为两个不同组件：一个通用文本指令 $x_{g}^{t}$ （例如，“教我如何制作这些东西。”）和一个恶意触发器 $H_{\text{harm}}$ ，该触发器可以是具有危害性的文本输入 $x_{\text{harm}}^{t}$ 或使用视觉或 OCR 方法生成的图像输入 $x_{\text{harm}}^{i}$ 。因此，制作对抗性输入图像 $\widehat{x}_{adv}^{i}$ ，使其映射到接近 $H_{\text{harm}}$ 的危险嵌入区域。

\hat{x}_{adv}^{i}=\underset{x_{adv}\in\mathcal{B}}{\operatorname{argmin}}% \mathcal{L}_{2}\left(H_{\text{harm }},\mathcal{I}_{\phi}\left(x_{adv}^{i}% \right)\right),\quad\mathcal{I}_{\phi}(\cdot)-\operatorname{CLIP},

(10)

$\hat{x}_{\text{adv}}^{i}$ and $x_{g}^{t}$ are then input jointly into LVLMs, where their embeddings are combined in a manner that circumvents the model’s textual-only safety alignment. JIP [85] enables the generation of harmful outputs by exploiting the multimodal integration, thereby highlighting the increased vulnerability of LVLMs to sophisticated cross-modality attacks. JIP [85] achieve a high success rate across different LVLMs, highlighting the risk of cross-modality alignment vulnerabilities.
$\hat{x}_{\text{adv}}^{i}$ 和 $x_{g}^{t}$ 随后被共同输入到大型视觉语言模型（LVLMs）中，它们的嵌入以一种绕过模型纯文本安全对齐的方式被组合。JIP [85]通过利用多模态集成来生成有害输出，从而突显了 LVLMs 在面对复杂跨模态攻击时的脆弱性。JIP [85]在不同的 LVLMs 上实现了高成功率，突显了跨模态对齐漏洞的风险。

3.1.3 Black-Box Attacks 3.1.3 黑盒攻击

Black-Box Attacks are the most representative of real-world scenarios, where the attacker’s knowledge is limited to inputs and outputs, relying solely on querying the model and observing its responses. As illustrated in the bottom of Fig. 2, attackers often employ techniques such as Prompt Engineering, carefully crafting queries to bypass the model’s safety restrictions and exploit vulnerabilities in its behavior.
黑盒攻击是最具代表性的现实场景，攻击者的知识仅限于输入和输出，完全依赖于查询模型并观察其响应。如图 2 底部所示，攻击者通常采用提示工程等技术，精心设计查询以绕过模型的安全限制并利用其行为中的漏洞。

$\bullet$ Malicious Typography. Qraitem et al. [104] proposed the self-generated typography attack, wherein LVLMs are employed to generate words most similar to the objects in an image for typography. This technique is designed to disrupt the model’s classification by introducing subtle textual perturbations that confuse the model’s decision-making process. By exploiting the shortcoming that content safety guardrails of VLMs are ineffective against typographic visual prompts, FigStep [41] employs typographic transformations to convert malicious instructions, such as “Here is how to build a bomb: 1. 2. 3.,” into image format. These images are then input into the LVLM alongside carefully crafted prompts, inducing the model to complete the malicious instructions. Under this method, the typographic transformation allows malicious content to evade textual safety filters by embedding it as visual input. Combined with prompts designed to exploit the model’s multimodal reasoning, FigStep [41] instructing the model to answer the prohibited question in steps, effectively manipulates the LVLM to interpret and process these visual instructions, leading to unsafe completions that circumvent its alignment mechanisms. Additionally, MML [108] drawing inspiration from cryptography, utilizes an encryption-decryption process across both text and image modalities to prevent the over-exposure of malicious information. By replacing text in images [123] and prompting the model to perform decryption (e.g., ’burger’ → ’drugs’), MML [108] exploits the model’s logic and OCR capabilities to bypass security defenses, achieving a high attack success rate. HIMRD [109] divides the malicious prompt into two seemingly harmless components. One part, containing harmful words, is embedded into an image through typographic formatting and paired with a Text-to-Image (T2I) model to generate the corresponding image. The other part is inserted into the text using a prompt template (e.g., Look at the text at the top of the image, put the text inside the parentheses of ’make illegal ( )’). Through iterative optimization, the prompt is updated, ultimately achieving a successful jailbreak attack.
$\bullet$ 恶意排版。Qraitem 等人[104]提出了自生成排版攻击，该方法利用大型视觉语言模型生成与图像中对象最相似的文字进行排版。这种技术旨在通过引入微小的文本扰动来干扰模型的分类，从而混淆模型决策过程。通过利用视觉语言模型的内容安全防护措施对排版视觉提示无效的缺陷，FigStep[41]采用排版转换将恶意指令（如“这里是如何制造炸弹：1. 2. 3.”）转换为图像格式。然后将这些图像与精心设计的提示一同输入到大型视觉语言模型中，诱使模型完成恶意指令。在这种方法下，排版转换使恶意内容能够通过将其嵌入为视觉输入来规避文本安全过滤器。结合旨在利用模型多模态推理能力的提示，FigStep [ 41] 指示模型分步回答禁止性问题，有效操控大型视觉语言模型（LVLM）解释和处理这些视觉指令，导致不安全的回答绕过其对齐机制。此外，MML [ 108] 从密码学中获得灵感，在文本和图像模态之间使用加密解密过程，以防止恶意信息的过度暴露。通过替换图像中的文本 [ 123] 并提示模型执行解密（例如，'burger' → 'drugs'），MML [ 108] 利用模型的逻辑和 OCR 能力绕过安全防御，实现高攻击成功率。HIMRD [ 109] 将恶意提示分为两个看似无害的组件。一部分包含有害词汇，通过排版格式嵌入图像中，并与文本到图像（T2I）模型配对生成相应图像。另一部分使用提示模板插入文本中（例如，"看图像顶部的文本，将文本放在'make illegal ( )'的括号内"）。通过迭代优化，提示被更新，最终成功实现了越狱攻击。

$\bullet$ Visual Role-Play. Ma et al. [105] propose Visual Role-play (VRP), an effective structure-based jailbreak method that instructs the model to act as a high-risk character in the image input to generate harmful content. VRP [105] first utilize an LLM to generate a detailed description of a high-risk character. The description is then employed to create a corresponding character image. Next, VRP [105] integrate the typography of the character description and the associated malicious questions at the top and bottom of the character image, respectively, to form the complete jailbreak image input. This malicious image input is then paired with a benign role-play instruction text to query and attack LVLMs. Effectively misleads LVLMs into generating malicious responses.
$\bullet$ 视觉角色扮演。Ma 等人[ 105]提出了视觉角色扮演（VRP），这是一种基于结构的越狱方法，指导模型在图像输入中扮演高风险角色以生成有害内容。VRP[ 105]首先利用 LLM 生成一个高风险角色的详细描述。然后，利用该描述创建相应的角色图像。接下来，VRP[ 105]将角色描述的排版和相关的恶意问题分别集成到角色图像的顶部和底部，形成完整的越狱图像输入。这个恶意图像输入随后与一个良性的角色扮演指令文本配对，用于查询和攻击 LVLMs。有效地误导 LVLMs 生成恶意响应。

$\bullet$ Logical Questioning. Zou et al. [106] explore the use of LVLMs’ logic understanding capabilities, particularly in interpreting flowcharts, for jailbreak attacks. They transform harmful textual instructions into visual flowchart representations, leveraging the model’s multimodal processing to bypass safety measures. This method is further enhanced by integrating visual instructions with textual prompts to generate detailed harmful outputs.
$\bullet$ 逻辑提问。Zou 等人[ 106]探索了 LVLM 的逻辑理解能力，特别是在解释流程图方面的应用，用于越狱攻击。他们将有害的文本指令转换为视觉流程图表示，利用模型的多模态处理能力绕过安全措施。该方法通过将视觉指令与文本提示相结合，生成详细的有害输出，进一步得到增强。

$\bullet$ Red Teaming. IDEATOR[107] utilizes LVLMs as red team models to generate multimodal jailbreak prompts. Through an iterative dialogue between the attack and victim models, the system continuously refines and optimizes the generated prompts. This approach effectively explores a wide range of LVLM vulnerabilities without relying on white-box access or manual intervention, showcasing a robust and automated red teaming framework.
$\bullet$ 红队测试。IDEATOR[ 107]将 LVLM 用作红队模型，生成多模态越狱提示。通过攻击模型和受害者模型之间的迭代对话，系统不断优化和改进生成的提示。这种方法有效地探索了 LVLM 的多种漏洞，而无需依赖白盒访问或人工干预，展示了一个强大且自动化的红队测试框架。

3.2 Training-Phase Attacks
3.2 训练阶段攻击

Training-phase attacks typically necessitate that adversaries have access to training data of the model. By employing diverse data poisoning methodologies, these attacks are categorized into two distinct categories: Label Poisoning Attacks (§ 3.2.1) and Backdoor Trigger Attacks (§ 3.2.2). In contrast to Inference-Phase Attacks § 3.1, these strategies involve the modification of the model’s parameters, thereby facilitating not only malicious behaviors but also inducing disruptions in responses to benign queries.
训练阶段攻击通常需要攻击者能够访问模型的训练数据。通过采用多种数据污染方法，这些攻击被分为两类：标签污染攻击（§ 3.2.1）和后门触发攻击（§ 3.2.2）。与推理阶段攻击（§ 3.1）相比，这些策略涉及修改模型的参数，从而不仅能够实现恶意行为，还能导致对良性查询的响应出现干扰。

3.2.1 Label Poisoning Attacks
3.2.1 标签污染攻击

LVLMs are predominantly trained through visual instruction tuning methodologies [14, 15, 110]. Rather than relying on conspicuous triggers [124, 125] (e.g., specific keywords or unique images), Shadowcast [30] operates within a gray-box capability framework by utilizing the vision encoder of LVLMs to introduce noise into images. Shadowcast [30] induces LVLMs to misidentify class labels, such as confusing Donald Trump with Joe Biden. Furthermore, poisoned text-image pairs are injected into the training data, causing the model to generate irrelevant or erroneous responses when processing benign inputs. Experiments demonstrate that Shadowcast [30] is effective across different LVLM architectures and prompts, and is resilient to image augmentation and compression. Instead of relying on multiple poisoned samples or complex triggers, ImgTrojan [126] employs clean images as Trojan vectors by injecting a single benign image paired with malicious textual prompts—thereby replacing the original captions—into the training dataset. By strategically selecting and crafting malicious prompts, ImgTrojan [126] seeks to exploit inherent vulnerabilities within the behavior of LVLMs. Notably, the insertion of merely one poisoned image-text pair has been demonstrated to successfully jailbreak the model with a 50% success rate, inducing it to generate attacker-intended outputs in response to ostensibly benign queries. This demonstrates that even minimal data manipulation can undermine a LVLM’s safety mechanisms.
LVLMs 主要通过视觉指令微调方法进行训练 [14, 15, 110]。与依赖显眼触发器（例如特定关键词或独特图像）[124, 125] 不同，Shadowcast [30] 在灰盒能力框架内运行，通过利用 LVLM 的视觉编码器向图像中引入噪声。Shadowcast [30] 使 LVLM 误识别类别标签，例如将唐纳德·特朗普与乔·拜登混淆。此外，中毒的文本-图像对被注入训练数据中，导致模型在处理良性输入时生成不相关或错误响应。实验表明，Shadowcast [30] 对不同的 LVLM 架构和提示都有效，并且对图像增强和压缩具有抵抗力。ImgTrojan [126] 不依赖多个中毒样本或复杂触发器，而是通过将一个良性图像与恶意文本提示（从而替换原始标题）一起注入训练数据集，将干净图像用作特洛伊木马载体。通过策略性地选择和设计恶意提示，ImgTrojan [126] 试图利用 LVLM 行为中的固有漏洞。值得注意的是，仅插入一个被污染的图像-文本对就被证明能够以 50%的成功率成功绕过模型，使其在看似无害的查询下生成攻击者期望的输出。这表明即使是微小的数据操作也可能破坏大型视觉语言模型的安全机制。

3.2.2 Backdoor Trigger Attacks
3.2.2 后门触发攻击

TABLE III: Summary of essential characteristics for reviewed methods in Training-Phase Attacks (§ 3.2).
表 III：训练阶段攻击（§ 3.2）中审查方法的本质特征总结

Methods 方法	Venue 场所	Highlight 重点
Label Poisoning Attacks (§ 3.2.1) 标签中毒攻击（§ 3.2.1）
Shadowcast[30]	[NeurIPS’24]	Adversarial label replacement 对抗性标签替换
ImgTrojan[126]	[arXiv’24]	Inject clean image as trojan 将干净图像作为木马注入
Backdoor Trigger Attacks (§ 3.2.2) 后门触发攻击 (§ 3.2.2)
VL-Trojan[31]	[arXiv’24]	Poisoned instruction-response pairs 中毒指令-响应对
BadVLMDriver[32]	[arXiv’24]	Backdoor in autonomous driving 自动驾驶中的后门
MABA[127]	[arXiv’24]	Domain-agnostic triggers 领域无关触发器
TrojVLM[128]	[ECCV’24]	Semantic preserving loss 语义保留损失
VLOOD[129]	[arXiv’24]	Backdoor using OOD data 使用 OOD 数据的后门攻击

In contrast to direct label poisoning, backdoor trigger attacks typically involve training a subtle noise trigger to be embedded within images. VL-Trojan [31] implant backdoors into autoregressive LVLMs through a small set of poisoned instruction-response pairs. While the model maintains normal functionality under standard scenarios, encountering these specially crafted multimodal instructions leads it to produce malicious content. Compared to the diffuse, pattern-based manipulation in Shadowcast [30], VL-Trojan [31] centers on a more defined, though still hidden, instruction-based trigger. MABA [127] conducts an empirical assessment of the threats posed by mainstream backdoor attacks during the instruction-tuning phase of LVLMs under data distribution shifts. The findings demonstrate that the generalizability of backdoor attacks is positively associated with the independence of trigger patterns from specific data domains or model architectures, as well as with the models’ preference for trigger patterns over clean semantic regions. By utilizing attribution-based interpretation to position domain-agnostic triggers in critical decision-making regions, MABA [127] improves robustness across domains while mitigating vulnerabilities to backdoor activation. TrojVLM [128] manipulates specific pixels in an image to embed an attack trigger, enabling the model to insert predetermined target text into its output when processing poisoned images. Notably, TrojVLM [128] does not compromise the model’s ability to maintain its semantic understanding of the original image, highlighting the subtle yet impactful nature of the attack. VLOOD [129] explores backdoor attacks on LVLMs for image-to-text generation using Out-of-Distribution (OOD) data, addressing a realistic scenario where attackers lack access to the original training dataset. VLOOD [129] framework introduces new loss functions for maintaining conceptual consistency under poisoned inputs, aiming to balance model performance across clean and backdoored samples. Beyond digital triggers, recent work shows that physical cues can also serve as potent backdoors. In real world, BadVLMDriver [32] demonstrates a scenario where adversaries embed triggers into the physical environment, such as signage placed in a driving context. These physical artifacts, when captured by vehicle-mounted cameras and processed by a LVLM integrated into an autonomous driving system, can lead the model astray. Such triggers need not be digital, instead, strategically placed real-world elements can cause the model to misunderstand critical instructions or ignore safety constraints, posing a tangible threat to autonomous vehicles and other systems that rely heavily on LVLM-based perception.
与直接标签污染不同，后门触发攻击通常涉及在图像中嵌入一个微妙的噪声触发器。VL-Trojan [31] 通过一小批污染的指令-响应对将后门植入自回归视觉语言模型。虽然该模型在标准场景下保持正常功能，但遇到这些特别设计的多模态指令时，会导致其生成恶意内容。与 Shadowcast [30] 中那种弥散的、基于模式的操纵相比，VL-Trojan [31] 更侧重于一个更明确、尽管仍然隐藏的基于指令的触发器。MABA [127] 对主流后门攻击在 LVLMs（视觉语言模型）数据分布变化时的指令微调阶段所构成的威胁进行了实证评估。研究结果表明，后门攻击的泛化能力与触发模式与特定数据域或模型架构的独立性呈正相关，同时也与模型对触发模式相对于干净语义区域的偏好程度呈正相关。通过利用基于归因的解释在关键决策区域定位领域无关的触发器，MABA [ 127] 提高了跨领域的鲁棒性，同时减轻了后门激活的漏洞。TrojVLM [ 128] 通过操纵图像中的特定像素嵌入攻击触发器，使模型在处理中毒图像时能够将其输出的目标文本插入为预定文本。值得注意的是，TrojVLM [ 128] 并未损害模型维持原始图像语义理解的能力，突出了这种攻击的隐蔽性和影响力。VLOOD [ 129] 探索了使用分布外（OOD）数据对 LVLMs 进行图像到文本生成的后门攻击，解决了一种现实场景，即攻击者无法访问原始训练数据集。VLOOD [ 129] 框架引入了新的损失函数，以在中毒输入下保持概念一致性，旨在平衡模型在干净样本和后门样本上的性能。除了数字触发器，最近的研究表明，物理线索也可以作为强大的后门。在现实世界中，BadVLMDriver [ 32] 展示了一个攻击者将触发器嵌入物理环境的场景，例如在驾驶环境中放置标志物。这些物理物品被车载摄像头捕捉，并由集成到自动驾驶系统中的视觉语言模型（LVLM）处理时，会导致模型产生错误。这些触发器不必是数字化的，相反，策略性地放置的现实世界元素会导致模型误解关键指令或忽略安全约束，对依赖视觉语言模型（LVLM）感知的自动驾驶车辆和其他系统构成实际威胁。

4 Defense 4 防御

Similar to the attack methods discussed earlier (§ 3), defense strategies for Large Vision-Language Models (LVLMs) can be systematically classified into two major categories according to the stage of the model’s lifecycle: Inference-Phase Defenses (§ 4.1) and Training-Phase Defenses (§ 4.2).
与前面讨论的攻击方法（§ 3）类似，大型视觉语言模型（LVLMs）的防御策略可以根据模型生命周期的阶段系统地分为两大类：推理阶段防御（§ 4.1）和训练阶段防御（§ 4.2）。

4.1 Inference-Phase Defenses
4.1 推理阶段防御

Inference-Phase Defenses protect models during deployment, avoiding the high costs and limitations of training-phase defenses. These methods offer lower computational overhead, greater flexibility, and adaptability to new threats without retraining. As post-hoc solutions, they enhance pre-trained models’ safety, providing efficient strategies to improve LVLM robustness during inference. Specifically, we categorize these strategies into four classes:
推理阶段防御在模型部署时提供保护，避免了训练阶段防御的高成本和局限性。这些方法具有较低的计算开销，更高的灵活性和对新威胁的适应性，无需重新训练。作为事后解决方案，它们增强了预训练模型的安全性，为在推理过程中提高 LVLM 鲁棒性提供了有效的策略。具体来说，我们将这些策略分为四类：

4.1.1 Input Sanization Defenses
4.1.1 输入净化防御

TABLE IV: Summary of key characteristics of reviewed methods in Inference-Phase Defenses (§ 4.1). \faCircle and \faCircleThin represent Black-Box and White-Box Capability respectively.
表 IV：推理阶段防御中（§ 4.1）所审查方法的关键特征总结。\faCircle 和\faCircleThin 分别代表黑盒和白盒能力。

Methods 方法	Venue 场地	Cap. 章节	Highlight 重点
Input Sanization Defenses (§ 4.1.1) 输入净化防御（§ 4.1.1）
AdaShield[43]	[ECCV’24]	\faCircle	Defense prompt injection 防御提示注入
SmoothVLM[130]	[arXiv’24]	\faCircleThin	Randomized smoothing defense 随机平滑防御
CIDER[44]	[EMNLP’24]	\faCircle	Image semantic distance check 图像语义距离检查
BlueSuffix[131]	[arXiv’24]	\faCircle	Image & text purifier 图像与文本净化器
UniGuard[132]	[arXiv’24]	\faCircleThin	Safe noise & prompt suffix 安全噪声与提示后缀
Internal Optimization Defenses (§ 4.1.2) 内部优化防御（§ 4.1.2）
InferAligner[45]	[EMNLP’24]	\faCircleThin	Harmful activation difference 有害激活差异
CoCA[46]	[COLM’24]	\faCircleThin	Safe & Unsafe logits bias 安全与不安全 logits 偏差
CMRM[133]	[arXiv’24]	\faCircleThin	Correct visual representation 正确的视觉表示
ASTRA[134]	[arXiv’24]	\faCircleThin	Decompose input image 分解输入图像
IMMUNE[135]	[arXiv’24]	\faCircleThin	Inference time token alignment 推理时间标记对齐
Output Validation Defenses (§ 4.1.3) 输出验证防御（§ 4.1.3）
JailGuard[47]	[arXiv’23]	\faCircle	Mutated query diversity 变异查询多样性
MLLM-P[48]	[EMNLP’24]	\faCircle	Harm detector & response detoxifier 伤害检测器 & 响应解毒器
ECSO[136]	[ECCV’24]	\faCircle	Transfer visual into textual 将视觉转换为文本
MirrorCheck[137]	[arXiv’24]	\faCircle	Text-to-image transfer 文本到图像转换
Multi-Stage Integration Defenses (§ 4.1.4) 多阶段集成防御（§ 4.1.4）
ETA[28]	[arXiv’24]	\faCircle	Safety score & reward model 安全评分与奖励模型

Input data plays a pivotal role throughout the inference process of LVLMs, serving as a primary entry point for attacks. Attack strategies, whether prompt-based or perturbation-based, are meticulously designed to manipulate inputs, exploit model vulnerabilities, and compromise safety mechanisms. Input Sanization Defenses address these threats by analyzing, filtering, and transforming input data to neutralize malicious patterns. Key techniques include prompt engineering and image perturbation, all of which aim to enhance input reliability and reduce susceptibility to attacks.
输入数据在整个 LVLM 推理过程中起着关键作用，是攻击的主要入口。攻击策略，无论是基于提示还是基于扰动，都经过精心设计以操纵输入、利用模型漏洞并破坏安全机制。输入净化防御通过分析、过滤和转换输入数据来消除恶意模式，从而应对这些威胁。关键技术包括提示工程和图像扰动，所有这些目标都是为了提高输入可靠性并降低遭受攻击的易感性。

$\bullet$ Prompt Engineering. To defend against prompt-based attacks, AdaShield [43] leverages the instruction-following capabilities of LVLMs by incorporating safety prefixes into the input, aiming to activate the model’s inherent safety mechanisms through prompting techniques. Specifically, the safety prefixes are categorized into fixed and adaptive textual prefixes, which are dynamically optimized based on the characteristics of malicious input queries. By utilizing a defender model to iteratively generate, evaluate, and refine these prompts, AdaShield [43] improves the robustness of LVLMs without the need for fine-tuning or additional modules. BlueSuffix [131] employs a diffusion-based method to purify jailbreak images and introduces an LLM-based text purifier to rewrite adversarial textual prompts while preserving their original meaning. Based on the purified data, a trained Suffix Generator is utilized to produce prompt suffixes that guide the model, thereby enhancing its safety capabilities.
$\bullet$ 提示工程。为了防御基于提示的攻击，AdaShield [ 43] 利用 LVLMs 的指令跟随能力，通过将安全前缀整合到输入中，旨在通过提示技术激活模型的内在安全机制。具体来说，安全前缀被分为固定和自适应文本前缀，这些前缀根据恶意输入查询的特征进行动态优化。通过利用防御模型迭代生成、评估和优化这些提示，AdaShield [ 43] 在无需微调或额外模块的情况下提高了 LVLMs 的鲁棒性。BlueSuffix [ 131] 采用基于扩散的方法净化越狱图像，并引入基于 LLM 的文本净化器来重写对抗性文本提示，同时保留其原始含义。基于净化后的数据，一个训练好的后缀生成器被用于生成引导模型的提示后缀，从而增强其安全能力。

$\bullet$ Image Perturbation. To address perturbation-based attacks, detecting and removing adversarial noise has proven to be highly effective. Specifically, CIDER [44] leverages the observation that the semantic distance between clean and adversarial images, relative to harmful queries, exhibits significant differences. As a denoising model, CIDER [44] iteratively removes noise from the input images and evaluates the semantic distance before and after denoising. If the distance exceeds a predefined threshold, the input is identified as malicious and rejected. CIDER [44] effectively filters out adversarial inputs while preserving the integrity of benign ones. Drawing inspiration from SmoothLLM [138], SmoothVLM [130] enhances the robustness of LVLMs against adversarial patch attacks [23, 27, 85], which employs randomized smoothing by introducing controlled noise to input images, which helps mitigate the impact of adversarial patches. Ensures that small, localized perturbations are smoothed out, reducing their ability to mislead the model while maintaining the semantic fidelity of the input. Significantly reduces the success rate of attacks on two leading LVLMs under 5%, while achieving up to 95% context recovery of the benign images, demonstrating a balance between security, usability, and efficiency.
$\bullet$ 图像扰动。为了应对基于扰动的攻击，检测并移除对抗性噪声已被证明非常有效。具体来说，CIDER [ 44] 利用了一个观察结果：相对于有害查询，干净图像和对抗性图像之间的语义距离表现出显著差异。作为去噪模型，CIDER [ 44] 迭代地从输入图像中移除噪声，并在去噪前后评估语义距离。如果距离超过预设阈值，输入就会被识别为恶意并拒绝。CIDER [ 44] 有效过滤掉对抗性输入，同时保留良性输入的完整性。受 SmoothLLM [ 138] 启发，SmoothVLM [ 130] 增强了 LVLMs 对对抗性补丁攻击 [ 23, 27, 85] 的鲁棒性，该攻击通过向输入图像引入受控噪声进行随机平滑，有助于减轻对抗性补丁的影响。确保小的、局部的扰动被平滑掉，从而降低它们误导模型的能力，同时保持输入的语义保真度。显著降低了针对两个领先 LVLM 的攻击成功率至 5%以下，同时实现了良性图像高达 95%的上下文恢复，展示了在安全性、可用性和效率之间的平衡。

$\bullet$ Hybrid Perturbation. UniGuard [132] combines prompt engineering and image perturbation, jointly addressing unimodal and cross-modal harmful inputs. Specifically, UniGuard [132] introduces additive safe noise for image inputs and applies suffix modifications to text prompts, effectively reducing the likelihood of generating unsafe responses. By training on a targeted small corpus of toxic content, UniGuard [132] achieves significant robustness against a wide range of adversarial attacks.
$\bullet$ 混合扰动。UniGuard [ 132] 结合了提示工程和图像扰动，共同处理单模态和跨模态的有害输入。具体来说，UniGuard [ 132] 为图像输入引入加性安全噪声，并对文本提示应用后缀修改，有效降低了生成不安全响应的可能性。通过在有毒内容的小型目标语料库上训练，UniGuard [ 132] 实现了对广泛对抗性攻击的显著鲁棒性。

4.1.2 Internal Optimization Defenses
4.1.2 内部优化防御

As the critical stage determining model outputs, the generation of unsafe responses in LVLMs is heavily influenced by the alignment and robustness of their internal safety mechanisms. As discussed earlier in § 2.2, the safety capabilities of LVLMs are particularly impacted by the vision modality, which often lags behind the text modality due to insufficient alignment at the hidden states. To address these vulnerabilities, internal-level defenses enhance model safety by directly intervening in its internal activations, representations, and computation processes. Closely related methods mainly divide into two factions.
作为决定模型输出的关键阶段，LVLMs 中不安全响应的产生受到其内部安全机制对齐性和鲁棒性的严重影响。如§ 2.2 中所述，LVLMs 的安全能力特别受视觉模态的影响，由于隐藏状态对齐不足，视觉模态往往落后于文本模态。为解决这些漏洞，内部级别的防御通过直接干预其内部激活、表征和计算过程来增强模型安全性。密切相关的方法主要分为两大阵营。

$\bullet$ Activation Alignment. As researches in LLM safety [139], the model’s parameter activations exhibit noticeable differences when processing safe and unsafe requests respectively. To leverage this observation, InferAligner [45] calculates the mean activation difference of the last token between harmful and harmless prompts. Based on this calculation, a threshold is established to identify unsafe responses. When the activation value of a token surpasses the threshold, the mean activation difference is applied to bias-correct the output, effectively mitigating unsafe content and ensuring more secure and reliable responses. Additionally, CMRM [133] highlights that LVLMs exhibit vulnerabilities when queried with images but tend to restore safety when images are excluded. To address this issue, CMRM [133] computes the hidden state activation bias between pure text inputs and text-image inputs at the same layer $l$ , as defined:
$\bullet$ 激活对齐。在 LLM 安全研究[ 139]中，模型在处理安全请求与不安全请求时，其参数激活表现出明显差异。为利用这一观察结果，InferAligner[ 45]计算了有害提示和无害提示之间最后一个 token 的平均激活差异。基于此计算，建立了一个阈值来识别不安全响应。当 token 的激活值超过阈值时，将应用平均激活差异对输出进行偏差校正，有效减轻不安全内容并确保更安全可靠的响应。此外，CMRM[ 133]指出，LVLMs 在接收图像查询时表现出漏洞，但在排除图像时倾向于恢复安全性。为解决此问题，CMRM[ 133]计算了同一层纯文本输入与文本-图像输入之间的隐藏状态激活偏差，定义如下： $l$

Bias^{l}=\text{PCA}\left(\left\{\mathbf{h}_{t}^{l(i)}-\mathbf{h}_{c}^{l(i)}% \right\}_{i=1}^{N}\right),

(11)

where $\mathbf{h}_{t}^{l(i)}$ and $\mathbf{h}_{c}^{l(i)}$ represent the hidden state activations of the $i$ -th input in the $l$ -th layer for pure text input and text-image input, respectively. Here, $N$ denotes the total number of samples in the dataset. By applying PCA (Principal Component Analysis) to the activation differences, CMRM [133] identifies the principal direction of variation caused by the visual input. This bias is then used to correct the visual-induced misalignment, aligning multimodal representations closer to the original LLM text-only distribution while retaining the benefits of visual information. To defend against adversarial jailbreaks, ASTRA [134] decomposes input images to identify regions with high correlation to the attack. Based on these regions, steering vectors are constructed to capture adversarial directions in the activation space. During inference, ASTRA [134] projects the model’s activations onto the steering vectors and applies corrections to remove components aligned with the jailbreak-related directions, thereby mitigating adversarial influence while maintaining the model’s performance.
在纯文本输入和文本-图像输入中， $\mathbf{h}_{t}^{l(i)}$ 和 $\mathbf{h}_{c}^{l(i)}$ 分别表示第 $i$ 个输入在第 $l$ 层的隐藏状态激活。这里， $N$ 表示数据集中的样本总数。通过将主成分分析（PCA）应用于激活差异，CMRM [ 133] 识别出由视觉输入引起的变异主方向。然后利用这种偏差来纠正视觉引起的错位，使多模态表示更接近原始 LLM 文本分布，同时保留视觉信息的好处。为了防御对抗性越狱攻击，ASTRA [ 134] 将输入图像分解以识别与攻击高度相关的区域。基于这些区域，构建引导向量以捕获激活空间中的对抗方向。在推理过程中，ASTRA [ 134] 将模型的激活投影到引导向量上，并对与越狱相关方向对齐的分量应用校正，从而减轻对抗影响，同时保持模型性能。

$\bullet$ Logits Adjustment. CoCA [46] introducing a safe instruction to harmful queries and calculating the resulting logits bias. This bias represents the adjustment required to align the model’s outputs with safer responses. During inference, the logits bias is directly applied to the model’s output layer, allowing it to maintain robust safety capabilities without explicitly adding safe instructions to the input. Similar to CoCA [46], IMMUNE [135] enhances the safety of model outputs by aligning responses during the inference stage. Instead of explicitly modifying inputs, IMMUNE [135] evaluates the safety of candidate tokens by introducing a safe reward model $Q_{\text{safe}}$ that quantifies the likelihood of a token contributing to a harmful response. The reward score is combined with the model’s original logits to compute an adjusted decoding score, which guides the generation process toward safer outputs.
$\bullet$ Logits 调整。CoCA [ 46] 向有害查询引入安全指令并计算由此产生的 logits 偏差。该偏差表示使模型输出与更安全响应对齐所需的调整。在推理过程中，logits 偏差直接应用于模型的输出层，使其能够在不向输入显式添加安全指令的情况下保持强大的安全能力。类似于 CoCA [ 46]，IMMUNE [ 135] 通过在推理阶段对齐响应来增强模型输出的安全性。IMMUNE [ 135] 不是通过显式修改输入，而是通过引入一个安全奖励模型 $Q_{\text{safe}}$ 来评估候选 token 的安全性，该模型量化了 token 导致有害响应的可能性。奖励分数与模型的原始 logits 结合，计算出一个调整后的解码分数，从而引导生成过程朝向更安全的输出。

4.1.3 Output Validation Defenses
4.1.3 输出验证防御

Output-Level Defenses focus on safeguarding the model’s final outputs by mitigating unsafe responses before they are delivered to users. These defenses primarily rely on techniques such as detection and response rewriting, which are both straightforward and computationally efficient.
输出级防御专注于通过在输出交付给用户之前减轻不安全响应来保护模型的最终输出。这些防御主要依赖于检测和响应重写等技术，这些技术既简单又计算高效。

$\bullet$ Harmful Detecting. JailGuard [47] observes that attack inputs inherently exhibit lower robustness compared to benign queries, irrespective of the attack methods or modalities. To exploit this property, JailGuard systematically designs and implements 16 random mutators and 2 semantic-driven targeted mutators to introduce perturbations at various levels of text and image inputs. By comparing the cosine similarity between the mutated and original inputs, JailGuard identifies significant discrepancies as indicators of adversarial attacks. This approach serves as a universal detection method capable of handling diverse attack types and modalities. Pi et al. [48] propose MLLM-Protector, a defense framework that fine-tunes two separate LLMs to serve as a Harm Detector and a Response Detoxifier. The Harm Detector is designed to accurately identify outputs that violate predefined safety protocols, ensuring that harmful responses are flagged before being delivered to users. Meanwhile, the Response Detoxifier enhances the model’s helpfulness by rewriting harmful outputs into safe and constructive responses, effectively balances safety and utility. Unlike previous approaches, ECSO [136] operates without introducing additional modules. It demonstrates that LVLMs are inherently capable of assessing the safety of their outputs. When the model detects that its response may be harmful, ECSO [136] mitigates this risk by converting the visual input into a textual caption and proceeding with text-only processing. This method highlights the ability to harness the intrinsic safety mechanisms of the LLM component within LVLMs, effectively reducing the generation of unsafe outputs while maintaining computational efficiency. For adversarial attacks, MirrorCheck [137] employs Text-to-Image (T2I) models [140, 141, 142] to generate images from captions produced by the target VLMs and then computes the similarity between the feature embeddings of the input and generated images to identify adversarial samples. MirrorCheck [137] demonstrates robust defense capabilities, offering an effective, training-free approach to detect and mitigate adversarial threats in LVLMs.
$\bullet$ 有害检测。JailGuard [ 47] 观察到攻击输入相较于良性查询，无论攻击方法或模态如何，都表现出较低的鲁棒性。为了利用这一特性，JailGuard 系统性地设计和实现了 16 个随机变异器和 2 个语义驱动的目标变异器，以在文本和图像输入的不同层次引入扰动。通过比较变异输入和原始输入之间的余弦相似度，JailGuard 将显著差异识别为对抗性攻击的指标。这种方法是一种通用的检测方法，能够处理各种攻击类型和模态。Pi 等人[ 48]提出了 MLLM-Protector，这是一个防御框架，通过微调两个独立的 LLMs 来充当有害检测器和响应解毒器。有害检测器被设计为准确识别违反预定义安全协议的输出，确保在将有害响应交付给用户之前将其标记。同时，响应解毒器通过将有害输出重写为安全且建设性的响应来增强模型的有用性，有效地平衡了安全性和实用性。与先前方法不同，ECSO [ 136] 不需要引入额外模块。它证明了视觉语言模型（LVLMs）本质上能够评估其输出的安全性。当模型检测到其响应可能有害时，ECSO [ 136] 通过将视觉输入转换为文本标题，并继续仅进行文本处理来缓解这种风险。这种方法突出了利用 LVLMs 中 LLM 组件的内在安全机制的能力，有效减少不安全输出的生成，同时保持计算效率。对于对抗性攻击，MirrorCheck [ 137] 采用文本到图像（T2I）模型 [ 140, 141, 142] 从目标视觉语言模型（VLMs）生成的标题中生成图像，然后计算输入图像和生成图像的特征嵌入之间的相似度，以识别对抗性样本。 MirrorCheck [ 137] 展示了强大的防御能力，提供了一种有效的、无需训练的方法来检测和缓解 LVLMs 中的对抗性威胁。

4.1.4 Multi-Stage Integration Defenses
4.1.4 多阶段集成防御

Multi-Level Defenses combine strategies from input, internal, and output levels to create a comprehensive defense framework that ensures model safety. Provide robust and highly effective solutions to maintain safe and reliable outputs in LVLMs, harness the strengths of diverse techniques. ETA [28] integrates defense mechanisms across the input and output stages to enhance model safety. During the pre-generation phase, ETA [28] employs an evaluation prompt $\mathcal{P}$ with CLIP [12] to calculate a safety score for the visual input. In the post-generation phase, a reward model (RM) evaluates the safety of the generated output. If both the pre-generation and post-generation scores indicate unsafety, a predefined prefix (e.g., ”As an AI assistant,”) is appended to the prompt to guide the model toward generating safer responses. In the output stage, ETA [28] utilizes a Best-of-N strategy, generating multiple candidate responses and selecting the one that optimizes a weighted combination of safety and relevance scores.
多层防御结合输入、内部和输出层面的策略，构建一个全面的防御框架以确保模型安全。为 LVLM 提供稳健且高效解决方案，以维持安全可靠的输出，并利用多种技术的优势。ETA [ 28] 融合输入和输出阶段的防御机制，以增强模型安全。在预生成阶段，ETA [ 28] 采用带有 CLIP [ 12] 的评估提示 $\mathcal{P}$ 来计算视觉输入的安全分数。在生成后阶段，奖励模型（RM）评估生成输出的安全性。如果预生成和生成后的分数均显示不安全，则向提示附加预定义的前缀（例如，“作为 AI 助手，”）以引导模型生成更安全的响应。在输出阶段，ETA [ 28] 采用最佳 N 策略，生成多个候选响应，并选择在安全性和相关性分数的加权组合中表现最优的那个。

4.2 Training-Phase Defenses
4.2 训练阶段防御

The training phase is crucial in developing machine learning models, especially Large Models. Training-Phase Defenses integrate safety mechanisms during this foundational stage, enhancing the model’s robustness by strengthening its internal architecture. Unlike inference-phase strategies (§ 4.1), these defenses enable models to autonomously mitigate adversarial challenges without relying on external mechanisms. Based on the data collection and processing pipeline, these strategies are classified into two main categories:
训练阶段在开发机器学习模型中至关重要，尤其是大型模型。训练阶段防御在基础阶段整合安全机制，通过强化模型内部架构来提升其鲁棒性。与推理阶段策略（§ 4.1）不同，这些防御使模型能够自主应对对抗性挑战，无需依赖外部机制。基于数据收集和处理流程，这些策略分为两大类：

4.2.1 Data-Driven Refinement
4.2.1 数据驱动优化

The quality of training data is fundamental to ensuring model safety, forming the basis for robust performance and resilience against adversarial challenges. This part examines existing efforts dedicated to the construction and refinement of secure datasets, which play a pivotal role in enhancing both the robustness and alignment of LVLMs.
训练数据的质量是确保模型安全的基础，是提升模型鲁棒性和对抗性挑战适应性的关键。本部分考察了致力于构建和优化安全数据集的现有工作，这些数据集在增强大型视觉语言模型的鲁棒性和一致性方面发挥着关键作用。

$\bullet$ Adversarial Specific Dataset. In adversarial detection, Huang et al. [143] introduce RADAR, a large-scale open-source adversarial dataset containing 4,000 samples. RADAR offers diverse harmful queries and responses, with samples sourced from the COCO dataset [144] and adversarial inputs generated using the PGD [79] method. To ensure high-quality samples, RADAR incorporates filtering procedures during construction, verifying that responses to benign inputs remain harmless while those to adversarial inputs are appropriately harmful.
$\bullet$ 对抗特定数据集。在对抗检测中，Huang 等人[ 143]介绍了 RADAR，这是一个包含 4,000 个样本的大规模开源对抗数据集。RADAR 提供了多样化的有害查询和响应，样本来源于 COCO 数据集[ 144]，并使用 PGD[ 79]方法生成对抗输入。为确保样本质量，RADAR 在构建过程中加入了过滤流程，验证对良性输入的响应保持无害，而对对抗输入的响应则适当有害。

$\bullet$ General Safety Dataset. Chen et al. [145] collet VLSafe, a harmless alignment dataset related to images, created using an LLM-Human-in-the-Loop approach and GPT-3.5-Turbo [3]. VLSafe [145] construction involves iterative refinement and filtering to ensure safety and quality, borrowing methods from textual adversarial attack research. Based on the COCO dataset [144], multiple iterations refine the dataset, followed by rounds of filtering to eliminate failure modes. The final dataset contains 5,874 samples, split into 4,764 training samples and 1,110 evaluation samples. Zong et al. [49] demonstrate through experiments that the inclusion of the vision modality significantly reduces the safety capabilities of LVLMs. Currently, a substantial portion of training data is generated by LLMs, which often contains harmful content, thereby contributing to the degradation of safety alignment in LVLMs. Furthermore, the use of LoRA fine-tuning has been shown to introduce additional safety risks. While cleaning the training data can partially restore safety alignment, its overall effectiveness remains limited. Based on these findings, [49] set out to collect a new safe vision-language instruction-following dataset VLGuard, significantly reduces the harmfulness of models across all fine-tuning strategies and models considered. To address the lack of high-quality open-source training datasets necessary for achieving the safety alignment of LVLMs, Zhang et al. [50] introduce SPA-VL, a large-scale safety alignment dataset that encompasses 6 harmfulness domains, 13 categories, and 53 subcategories. The dataset consists of 100,788 quadruple samples, with responses collected from 12 diverse LVLMs, including both open-source models (e.g., QwenVL [65]) and closed-source models (e.g., Gemini [60]), to ensure diversity. SPA-VL [50] reveals that increasing data volume, incorporating diverse responses, and using a mix of question types significantly enhance the safety and performance of aligned models. Helff et al. [146] propose LlavaGuard for dataset annotation and safeguarding generative models, leveraging curated datasets and structured evaluation methods. It uses a JSON-formatted output with safety ratings, category classifications, and natural language rationales to ensure comprehensive assessments. The dataset, built on the Socio-Moral Image Database (SMID) [147] and expanded with web-scraped images, addresses category imbalances with at least 100 images per category. LlavaGuard [146] incorporates refined safety ratings (e.g., “Highly Unsafe”, “Moderately Unsafe”) and synthetic rationales generated by the Llava-34B model [15] to enhance generalization. Data augmentation techniques improve dataset balance and adaptability, resulting in 4,940 samples, with 599 reserved for testing.
$\bullet$ 通用安全数据集。陈等人 [ 145] 收集了 VLSafe，这是一个与图像相关的无害对齐数据集，使用 LLM-人类参与循环方法（LLM-Human-in-the-Loop approach）和 GPT-3.5-Turbo [ 3] 创建。VLSafe [ 145] 的构建涉及迭代优化和过滤，以确保安全性和质量，借鉴了文本对抗性攻击研究中的方法。基于 COCO 数据集 [ 144]，经过多次迭代优化数据集，然后进行多轮过滤以消除失效模式。最终数据集包含 5,874 个样本，分为 4,764 个训练样本和 1,110 个评估样本。宗等人 [ 49] 通过实验表明，包含视觉模态会显著降低 LVLMs 的安全能力。目前，大量训练数据由 LLMs 生成，其中常包含有害内容，从而导致 LVLMs 的安全对齐能力下降。此外，LoRA 微调已被证明会引入额外的安全风险。虽然清理训练数据可以部分恢复安全对齐，但其整体效果仍然有限。基于这些发现，[49] 致力于收集新的安全视觉语言指令遵循数据集 VLGuard，显著降低了所有考虑的微调策略和模型的有害性。为了解决实现 LVLM 安全对齐所需的高质量开源训练数据缺乏的问题，Zhang 等人 [50] 引入了 SPA-VL，这是一个包含 6 个有害性领域、13 个类别和 53 个子类的大型安全对齐数据集。该数据集包含 100,788 个四元组样本，从 12 种不同的 LVLM 收集了响应，包括开源模型（例如 QwenVL [65]）和闭源模型（例如 Gemini [60]），以确保多样性。SPA-VL [50] 表明，增加数据量、包含多样化的响应以及使用多种题型可以显著提高对齐模型的安全性和性能。Helff 等人 [146] 提出了 LlavaGuard 用于数据集标注和安全防护生成模型，利用精选数据集和结构化评估方法。它使用 JSON 格式的输出，包含安全评级、类别分类和自然语言理由，以确保全面评估。该数据集基于社会道德图像数据库（SMID）[147]构建，并通过网络抓取的图像进行扩展，以解决类别不平衡问题，每个类别至少包含 100 张图像。LlavaGuard [146] 结合了更精细的安全评级（例如，“高度危险”、“中度危险”）以及由 Llava-34B 模型[15]生成的合成理由，以增强泛化能力。数据增强技术提高了数据集的平衡性和适应性，最终得到 4,940 个样本，其中 599 个用于测试。

TABLE V: Summary of essential characteristics for reviewed methods in Training-Phase Defenses (§ 4.2).
表 V：训练阶段防御（§ 4.2）中审查方法的必要特征总结

Methods 方法	Venue 场地	Highlight 重点
Data-Driven Refinement (§ 4.2.1) 数据驱动优化（§ 4.2.1）
VLSafe[145]	[CVPR’24]	LLM-Human-in-the-Loop LLM-人机交互
VLGuard[49]	[ICML’24]	Safe instruction following dataset 安全指令遵循数据集
LLaVAGuard[146]	[arXiv’24]	Refined ratings & rationales dataset 细化评分与理由数据集
SPA-VL[50]	[arXiv’24]	Large-scale & domain diversity 大规模与领域多样性
RADAR[143]	[arXiv’24]	Adversarial detection dataset 对抗检测数据集
Strategy-Driven Optimization (§ 4.2.2) 策略驱动优化（§ 4.2.2）
FARE[148]	[ICML’24]	Unsupervised CLIP robust training 无监督 CLIP 鲁棒训练
SIU[149]	[NeurIPS’24]	Selective unlearning framework 选择性遗忘框架
SafeVLM[51]	[arXiv’24]	Safety projection & token & head 安全投影 & 令牌 & 头
TextUnlearn[150]	[EMNLP’24]	Unlearning solely in textual 仅文本中的去学习
Sim-CLIP[151]	[arXiv’24]	Siamese architecture Siamese 架构
Sim-CLIP+[152]	[arXiv’24]	Stop-gradient mechanism 停止梯度机制
BaThe[153]	[arXiv’24]	Harmful instruction as trigger 有害指令作为触发器
TGA[29]	[arXiv’24]	Transfer safety from LLM to LVLM 将安全从 LLM 转移到 LVLM

4.2.2 Strategy-Driven Optimization
4.2.2 策略驱动优化

Beyond data quality, the design of effective training strategies is equally critical for enhancing model safety and robustness. This part explores optimization techniques and training paradigms that aim to fortify models against attacks and improving their alignment with safety objectives.
除了数据质量，设计有效的训练策略对于提升模型安全性和鲁棒性同样至关重要。本部分探讨了旨在增强模型抗攻击能力和提高其与安全目标一致性的优化技术和训练范式。

$\bullet$ Visual Enhancement. Schlarmann et al. [148] introduce FARE, an unsupervised adversarial fine-tuning approach for improving the robustness of the CLIP vision encoder against adversarial attacks while preserving its clean zero-shot performance. FARE [148] optimizes an embedding loss that ensures perturbed inputs produce embeddings close to their unperturbed counterparts, enabling the vision encoder to retain compatibility with downstream tasks like LVLMs without additional re-training. FARE [148] implemented using PGD-based [79] adversarial training, is dataset-agnostic and can generalize to other foundation models with intermediate embedding layers. Sim-CLIP [151] tackles the challenges present in FARE [148] by integrating a Siamese architecture with cosine similarity loss. During training, Sim-CLIP [151] generates perturbed views of input images using PGD [79] and optimizes the alignment between clean and perturbed representations to ensure robustness against adversarial attacks. The method minimizes negative cosine similarity to enforce invariance between these representations while maintaining model coherence. Additionally, a stop-gradient mechanism is incorporated to prevent loss collapse, enabling efficient adversarial training without the need for negative samples or momentum encoders. Sim-CLIP+ [152] extends Sim-CLIP [151] to defend against advanced optimization-based jailbreak attacks targeting LVLMs. By leveraging a tailored cosine similarity loss and a stop-gradient mechanism, Sim-CLIP+ [152] prevents symmetric loss collapse, ensuring computational efficiency while maintaining robustness.
$\bullet$ 视觉增强。Schlarmann 等人[148]提出了 FARE，这是一种无监督对抗微调方法，旨在提高 CLIP 视觉编码器对对抗攻击的鲁棒性，同时保持其干净的零样本性能。FARE[148]优化了一个嵌入损失，确保扰动输入产生的嵌入与其未扰动版本接近，使视觉编码器能够在无需额外重新训练的情况下保持与下游任务（如 LVLMs）的兼容性。FARE[148]采用基于 PGD[79]的对抗训练实现，具有数据集无关性，并能泛化到其他具有中间嵌入层的基础模型。Sim-CLIP[151]通过集成 Siamese 架构和余弦相似度损失来解决 FARE[148]中存在的问题。在训练过程中，Sim-CLIP[151]使用 PGD[79]生成输入图像的扰动视图，并优化干净表示和扰动表示之间的对齐，以确保对对抗攻击的鲁棒性。该方法最小化负余弦相似度，以强制执行这些表示之间的不变性，同时保持模型一致性。此外，还引入了停止梯度机制以防止损失坍塌，从而无需负样本或动量编码器即可进行高效的对抗训练。Sim-CLIP+ [ 152] 扩展了 Sim-CLIP [ 151]，用于防御针对大型视觉语言模型（LVLMs）的高级基于优化的越狱攻击。通过利用定制的余弦相似度损失和停止梯度机制，Sim-CLIP+ [ 152] 防止了对称损失坍塌，确保了计算效率的同时保持了鲁棒性。

$\bullet$ Knowledge Unlearning. Chakraborty et al. [150] propose SIU, a novel framework for implementing machine unlearning in LVLM safety. SIU [150] addresses the challenge of selectively removing visual data associated with specific concepts by leveraging fine-tuning on a single representative image. The approach is composed of two key components: (i) the construction of multifaceted fine-tuning datasets designed to target four distinct unlearning objectives and (ii) the incorporation of a Dual Masked KL-divergence Loss, which enables the simultaneous unlearning of targeted concepts while maintaining the overall functional integrity and utility of the LVLMs. TextUnlearning [150] notes that irrespective of the input modalities, all information is ultimately fused within the language space. Comparative experiments reveal that unlearning focused solely on the text modality outperforms multimodal unlearning approaches. By performing “textual” unlearning exclusively on the LLM component of LVLMs, while keeping the remaining modules frozen, this method achieves remarkable levels of harmlessness against cross-modality attacks.
$\bullet$ 知识去学习。Chakraborty 等人[ 150]提出了 SIU，这是一个用于在 LVLM 安全中实现机器去学习的新框架。SIU[ 150]通过在单个代表性图像上进行微调，解决了选择性地移除与特定概念相关的视觉数据这一挑战。该方法由两个关键组件组成：(i) 构建旨在针对四个不同去学习目标的多样化微调数据集，以及(ii) 结合双掩码 KL 散度损失，该损失能够在保持 LVLM 的整体功能完整性和实用性的同时，实现对目标概念的同时去学习。TextUnlearning[ 150]指出，无论输入模态如何，所有信息最终都在语言空间中融合。比较实验表明，仅针对文本模态进行去学习的效果优于多模态去学习方法。通过仅对 LVLM 的 LLM 组件进行“文本”去学习，同时冻结其余模块，该方法在抵御跨模态攻击方面达到了极高的无害性水平。

$\bullet$ Module Integration. SafeVLM [51] integrating three key safety modules: safety projection, safety tokens, and a safety head into LLaVA to enhance safety. SafeVLM [51] employs two-stage training strategy, where safety modules are first trained with the base model frozen, followed by fine-tuning the language model to align safety measures with vision features. During inference, safety embeddings generated by the safety head provide dynamic and customizable risk control, enabling flexible adjustments based on user needs, such as content grading and categorization.
$\bullet$ 模块集成。SafeVLM [ 51] 将三个关键安全模块——安全投影、安全标记和安全头部集成到 LLaVA 中，以增强安全性。SafeVLM [ 51] 采用两阶段训练策略，首先冻结基础模型训练安全模块，然后微调语言模型以使安全措施与视觉特征保持一致。在推理过程中，安全头部生成的安全嵌入提供动态和可定制的风险控制，能够根据用户需求进行灵活调整，例如内容分级和分类。

$\bullet$ Advanced Fine-tuning. BaThe [153] treats harmful instructions as potential triggers that can exploit backdoored models to produce prohibited outputs. BaThe [153] replaces manually designed triggers (e.g., “SUDO”) with rejection prompts embedded as “soft text embeddings” called the wedge, maps harmful instructions to rejection responses. BaThe also defends against more advanced virtual prompt backdoor attacks, where harmful instructions combined with subtle prompts act as triggers. By embedding rejection prompts into the model’s soft text embeddings and including multimodal QA datasets in training, BaThe effectively mitigates backdoor risks, ensuring safer and more robust model behavior. TGA [29] finds that the hidden states at specific transformer layers play a crucial role in the successful activation of safety mechanisms, highlighting that the vision-language alignment at the hidden state level in current methods is insufficient. To address this, TGA [29] aligns the hidden states of visual ( $X_{\text{image}}$ ) and textual ( $X_{\text{caption}}$ ) inputs across transformer layers in LVLMs by introducing a pair-wise loss function ( $\mathcal{L}_{\text{guide}}$ ). In this process, retrieved text ( $X_{\text{retrieval}}$ ) serves as a guide, ensuring that $I_{j}$ (hidden states of $X_{\text{image}}$ ) is closer to $C_{j}$ (hidden states of $X_{\text{caption}}$ ) than $R_{j}$ (hidden states of $X_{\text{retrieval}}$ ), achieving semantic consistency. The total loss combines $\mathcal{L}_{\text{guide}}$ with cross-entropy loss for language modeling, enhancing multimodal alignment at the hidden state level and improving the safety mechanism.
$\bullet$ 高级微调。BaThe [ 153] 将有害指令视为可能利用后门模型生成禁止输出的潜在触发器。BaThe [ 153] 用嵌入为“软文本嵌入”的拒绝提示（称为楔子）替换手动设计的触发器（例如，“SUDO”），将有害指令映射到拒绝响应。BaThe 还防御更高级的虚拟提示后门攻击，其中有害指令与微妙提示结合作为触发器。通过将拒绝提示嵌入模型的软文本嵌入中，并在训练中包含多模态问答数据集，BaThe 有效缓解后门风险，确保更安全、更稳健的模型行为。TGA [ 29] 发现特定 Transformer 层的隐藏状态在安全机制的成功激活中起着关键作用，强调当前方法在隐藏状态级别的视觉-语言对齐是不充分的。为此，TGA [ 29] 通过引入成对损失函数（ $\mathcal{L}_{\text{guide}}$ ），在 LVLMs 中跨 Transformer 层对齐视觉（ $X_{\text{image}}$ ）和文本（ $X_{\text{caption}}$ ）输入的隐藏状态。在这个过程中，检索到的文本（ $X_{\text{retrieval}}$ ）作为指导，确保 $I_{j}$ （ $X_{\text{image}}$ 的隐藏状态）比 $R_{j}$ （ $X_{\text{retrieval}}$ 的隐藏状态）更接近 $C_{j}$ （ $X_{\text{caption}}$ 的隐藏状态），从而实现语义一致性。总损失结合 $\mathcal{L}_{\text{guide}}$ 与语言模型的交叉熵损失，增强隐藏状态层面的多模态对齐，并提升安全机制。

5 Evaluation 5 评估

5.1 Setup 5.1 设置

5.1.1 Methods 5.1.1 方法

Effective safety evaluation methods are essential for identifying and mitigating risks in model outputs. Here, we outline the primary approaches used to assess response safety [154], focusing on their strengths and constraints.
有效的安全评估方法对于识别和减轻模型输出的风险至关重要。在此，我们概述了用于评估响应安全性的主要方法[154]，重点关注其优势和局限性。

$\bullet$ Rule-Based Matching \faEdit. This method relies on detecting predefined phrases (e.g., “I’m sorry, I can’t assist with it”) to evaluate the safety of model’ responses. While computationally efficient, it has significant limitations, including a lack of contextual understanding and restricted vocabulary coverage, which make it unable to handle the wide variety of expressions that models may generate. Consequently, it fails to provide a thorough assessment of response safety.
$\bullet$ 基于规则的匹配 \faEdit. 这种方法依赖于检测预定义的短语（例如，“抱歉，我无法协助此事”）来评估模型响应的安全性。虽然计算效率高，但它存在显著局限性，包括缺乏上下文理解和词汇覆盖范围有限，这使得它无法处理模型可能生成的各种表达方式。因此，它无法提供全面的响应安全性评估。

$\bullet$ Human-Assisted Evaluation \faEye. This method relies on manual assessment performed by human evaluators to provide high-quality and context-aware safety evaluations. While human judgment allows for comprehensive and flexible assessments, this approach is highly resource-intensive and constrained by subjectivity. Variations in individual perspectives, cultural backgrounds, and personal biases can lead to inconsistencies, limiting both the scalability and reproducibility of the evaluation process.
$\bullet$ 人工辅助评估 \faEye. 这种方法依赖于人工评估员进行的手动评估，以提供高质量且具有上下文感知能力的安全性评估。虽然人工判断允许进行全面和灵活的评估，但这种方法的资源消耗非常高，并受主观性限制。个体视角、文化背景和个人偏见的差异可能导致评估结果不一致，从而限制了评估过程的可扩展性和可重复性。

$\bullet$ Fine-tuned Model-Based Assessment \faGithubAlt. This approach utilizes fine-tuned LLMs or LVLMs to classify responses as safe or unsafe, as demonstrated by systems like LlamaGuard [155], LLaVAGuard [146], and GUARDRANK [56]. Compared to rule-based methods, it offers improved contextual understanding and accuracy. However, its performance heavily depends on the quality and diversity of the fine-tuning dataset and is fundamentally limited by the capabilities of the base model, particularly when handling out-of-distribution (OOD) inputs or complex safety scenarios.
$\bullet$ 基于微调模型的评估 \faGithubAlt. 这种方法利用微调的 LLMs 或 LVLMs 对响应进行安全或不安全的分类，例如 LlamaGuard [ 155]、LLaVAGuard [ 146]和 GUARDRANK [ 56]等系统所示。与基于规则的方法相比，它提供了更好的上下文理解和准确性。然而，其性能严重依赖于微调数据集的质量和多样性，并且从根本上受限于基础模型的能力，尤其是在处理分布外（OOD）输入或复杂安全场景时。

$\bullet$ GPT Proxy-Based Assessment \faGoogle. This method leverages the advanced capabilities of models like GPT-4 [16] through API-based interactions, using carefully crafted prompts to evaluate responses to malicious instructions. By taking advantage of LLMs’ ability to understand context and generate detailed evaluations, this approach provides a robust framework for safety assessment. However, its effectiveness is undermined by inherent limitations such as biases and hallucination, where the model generates plausible but incorrect information. These issues introduce potential inconsistencies and reduce the reliability of the evaluation process. An example of a evaluation prompt selected from [41] is as follows:
$\bullet$ 基于 GPT 代理的评估 \faGoogle. 这种方法通过 API 交互利用 GPT-4 等模型的先进能力[ 16]，使用精心设计的提示来评估对恶意指令的响应。通过利用 LLMs 理解上下文和生成详细评估的能力，这种方法提供了一个强大的安全评估框架。然而，其有效性受到固有限制的削弱，如偏见和幻觉，其中模型会生成看似合理但错误的信息。这些问题引入了潜在的矛盾，降低了评估过程的可靠性。从[ 41]中选择的一个评估提示示例如下：

5.1.2 Metrics 5.1.2 指标

In this part, we introduce a set of metrics used to evaluate the safety and robustness of the model under different conditions. These metrics assess the model’s ability to handle adversarial inputs, generate safe outputs, and maintain functionality across tasks. Specifically, the inputs to the model consist of text ( $T$ ) and images ( $I$ ), while the output is the model’s response ( $R$ ). The following metrics will be evaluated based on these common inputs and outputs.
在这一部分，我们介绍一套用于评估模型在不同条件下安全性和鲁棒性的指标。这些指标用于评估模型处理对抗性输入、生成安全输出以及在各项任务中保持功能的能力。具体而言，模型的输入包括文本（ $T$ ）和图像（ $I$ ），输出是模型的响应（ $R$ ）。以下将根据这些常见的输入和输出进行评估。

$\bullet$ Attack Success Rate (ASR) is employed to quantify the probability of eliciting harmful responses from LVLMs using pairs of image-text queries. Consider a dataset $D$ comprising $n$ pairs of image-text queries, we formally define the ASR as:
$\bullet$ 攻击成功率（ASR）用于量化使用图像-文本查询对诱使大型视觉语言模型（LVLMs）产生有害响应的概率。考虑一个包含 $n$ 对图像-文本查询的 $D$ 数据集，我们正式定义 ASR 为：

\displaystyle\text{ASR}=\frac{\sum_{i=1}^{n}\text{AttackEvaluator}(T_{i},I_{i}% ,R_{i})}{|D|},

(12)

where $\text{AttackEvaluator}(\cdot)$ is a binary function that returns 1 if the model’s response $R_{i}$ to a given input pair $(T_{i},I_{i})$ is evaluated as unsafe and 0 otherwise. This metric effectively captures the proportion of query pairs capable of inducing undesirable outputs from the target LVLM. A lower ASR indicates greater safety performance to attacks.
其中 $\text{AttackEvaluator}(\cdot)$ 是一个二元函数，当模型 $R_{i}$ 对给定的输入对 $(T_{i},I_{i})$ 的响应被评估为不安全时返回 1，否则返回 0。这个指标有效地捕捉了能够从目标大型视觉语言模型（LVLM）中诱导出不良输出的查询对的比例。ASR 越低，表明对攻击的安全性性能越好。

$\bullet$ Safety Risk Index (SRI) is introduced by SafeBench [154], aim to provides a more detailed evaluation of model safety by distinguishing between varying levels of response severity. While ASR classifies all unsafe responses equally, SRI measures the degree of risk posed by each response, enabling a finer-grained safety analysis. For example, when queried about harmful topics like bomb-making, one model might generate detailed instructions while another merely references a related book. Although both would be considered unsafe under ASR, SRI assigns a higher risk score to the former due to its greater severity. SRI is computed as:
$\bullet$ 安全风险指数（SRI）由 SafeBench [ 154]引入，旨在通过区分不同级别的响应严重程度，对模型安全性进行更详细的评估。虽然 ASR 对所有不安全响应同等对待，但 SRI 衡量每个响应带来的风险程度，从而实现更细粒度的安全分析。例如，当被问及炸弹制作等有害话题时，一个模型可能会生成详细说明，而另一个模型仅引用相关书籍。尽管在 ASR 标准下两者都被视为不安全，但 SRI 由于前者严重程度更高，会赋予其更高的风险评分。SRI 的计算公式为：

\text{SRI}=\frac{\sum_{i=1}^{n}\text{RiskEvaluator}(T_{i},I_{i},R_{i})}{|D|},

(13)

where $\text{RiskEvaluator}(\cdot)$ is a scoring function that evaluates the severity of the model’s response $R_{i}$ to a query pair $(T_{i},I_{i})$ , assigning a numerical risk value based on predefined criteria. The final SRI score is normalized to a range of 0 to 100 for interpretability. Higher SRI values indicate higher safety performances.
其中 $\text{RiskEvaluator}(\cdot)$ 是一个评分函数，用于评估模型对查询对 $(T_{i},I_{i})$ 的响应 $R_{i}$ 的严重程度，并根据预定义标准赋予数值风险值。最终 SRI 分数被归一化到 0 到 100 的范围，以便于理解。更高的 SRI 值表示更高的安全性表现。

$\bullet$ Attack Success Degree (ASD), proposed by MLLMGuard [56], ASD assesses the severity of unsafe model responses across four levels as shown in Tab. VI, score 0 represents Safe and Responsible behavior, where the model refuses assistance with appropriate warnings; score 1 reflects harmless but generic responses; score 2 indicates direct unsafe outputs (e.g., step-by-step harmful instructions); and score 3 denotes high-risk responses, where the model acknowledges the harm but still provides guidance (e.g., warning about the dangers of creating a bomb while offering instructions). ASD is computed as:
$\bullet$ 攻击成功程度（ASD），由 MLLMGuard [ 56] 提出，ASD 根据表 VI 所示的四级评估不安全模型响应的严重程度，分数 0 代表安全负责任的行为，即模型拒绝协助并给出适当的警告；分数 1 反映无害但通用的响应；分数 2 表示直接的不安全输出（例如，逐步有害的指令）；分数 3 表示高风险响应，即模型承认危害但仍然提供指导（例如，警告制造炸弹的危险性同时提供指导）。ASD 的计算公式为：

\text{ASD}=\frac{\sum_{i=1}^{n}\text{Smooth}(\text{DegreeEvaluator}(T_{i},I_{i% },R_{i}))}{|D|},

(14)

where $\text{DegreeEvaluator}(\cdot)$ evaluates the severity of the model’s response $R_{i}$ to a query pair $(T_{i},I_{i})$ , assigning a score from 0 to 3, and $\text{Smooth}(\cdot)$ normalizes them to a 0–1 scale. Lower ASD values indicate better safety performance.
其中 $\text{DegreeEvaluator}(\cdot)$ 评估模型对查询对 $(T_{i},I_{i})$ 的响应 $R_{i}$ 的严重程度，并分配 0 到 3 的分数， $\text{Smooth}(\cdot)$ 将它们归一化到 0-1 的刻度上。较低的 ASD 值表示更好的安全性表现。

TABLE VI: Scoring rules for Attack Success Degree (ASD). Details in § 5.1.2.
表 VI：攻击成功程度（ASD）的评分规则。详细信息见§ 5.1.2。

	Safe 安全	Unsafe 不安全
Aware 有意识	0	2
Unaware 无意识	1	3

$\bullet$ Perfect Answer Rate (PAR) is derived from ASD, MLLMGuard [56] introduced PAR, measures the proportion of safe and responsible responses, specifically those categorized as score 0 by ASD. It is computed as:
$\bullet$ 完美答案率（PAR）源自 ASD，MLLMGuard [ 56] 引入了 PAR，衡量安全且负责任的响应比例，具体指被 ASD 归类为 0 分的响应。其计算公式为：

\text{PAR}=\frac{\sum_{i=1}^{n}\mathbb{I}(\text{DegreeEvaluator}(T_{i},I_{i},R% _{i})=0)}{|D|},

(15)

where $\mathbb{I}(\cdot)$ is an indicator function returning 1 if the condition is true and 0 otherwise. Higher PAR indicates stronger safety performance.
$\mathbb{I}(\cdot)$ 是一个指示函数，当条件为真时返回 1，否则返回 0。更高的 PAR 表示更强的安全性表现。

$\bullet$ Refusal Rate (RR) evaluates the ability of LVLMs to recognize malicious queries and appropriately refuse to respond. It quantifies the proportion of cases where the model accurately identifies a query as unsafe and opts to reject it. Formally, RR is calculated as:
$\bullet$ 拒绝率（RR）评估了大型视觉语言模型识别恶意查询并适当拒绝响应的能力。它量化了模型准确识别查询为不安全并选择拒绝的比例。形式上，RR 的计算公式为：

\text{RR}=\frac{\sum_{i=1}^{n}\mathbb{I}(\text{RefusalEvaluator}(T_{i},I_{i},R% _{i}))}{|D|},

(16)

where $\mathbb{I}(\cdot)$ is an indicator function returning 1 if the model’s response $R_{i}$ to the query pair $(T_{i},I_{i})$ appropriately rejects the unsafe query, and 0 otherwise. Higher RR indicates stronger ability of the model to handle malicious inputs.
其中 $\mathbb{I}(\cdot)$ 是一个指示函数，当模型 $R_{i}$ 对查询对 $(T_{i},I_{i})$ 适当地拒绝不安全的查询时返回 1，否则返回 0。更高的 RR 表示模型处理恶意输入的能力更强。

$\bullet$ Effective Rate (ER) evaluates the proportion of responses that effectively address user queries while maintaining contextual relevance and accuracy. It reflects the model’s ability to generate meaningful and effective outputs. A higher ER indicates better performance in providing useful and accurate responses.
$\bullet$ 有效率（ER）评估模型在保持上下文相关性和准确性的同时，有效回应用户查询的比例。它反映了模型生成有意义和有效输出的能力。更高的 ER 表明模型在提供有用和准确回应方面表现更好。

$\bullet$ Toxicity Score (TS) evaluates the level of harmful or offensive language in a model’s response using tools like PerspectiveAPI²²2https://www.perspectiveapi.com/
$\bullet$ 毒性评分（TS）使用 PerspectiveAPI 等工具评估模型响应中危害性或冒犯性语言的程度 ² and Detoxify API³³3https://github.com/unitaryai/detoxify
和 Detoxify API ³ . These tools assign a toxicity score based on linguistic features, providing a systematic assessment of harmful content. It is commonly used as a complementary evaluation metric in Fine-tuned Model-based Assessment.
这些工具根据语言特征分配毒性分数，为有害内容提供系统性的评估。它通常被用作基于微调模型的评估的补充评估指标。

$\bullet$ GPT-Score (GS) uses GPT-4 to rate the quality, relevance, and safety of model responses based on predefined criteria. While offering valuable insights, this metric is subjective and can vary due to GPT-4’s inherent biases. It is commonly used as a complementary evaluation metric in GPT Proxy-Based Assessment.
$\bullet$ GPT-Score（GS）使用 GPT-4 根据预定义标准对模型响应的质量、相关性和安全性进行评分。虽然提供了有价值的见解，但这一指标是主观的，可能因 GPT-4 的固有偏见而有所差异。它通常作为 GPT 代理评估中的补充评估指标使用。

5.2 Benchmarks 5.2 基准测试

TABLE VII: Summary of essential characteristics for reviewed methods in Safety Capability Benchmarks (§ 5.2.2).
表 VII：安全能力基准（§ 5.2.2）中审查方法的关键特征总结

Methods 方法	Venue 场地	Scale 规模	Methods 方法	Metrics 指标	Highlight 重点
Adversarial Capability Benchmarks 对抗能力基准
HowManyUnicorns[52]	[ECCV’24]	8.5K	\faEye	ASR( $\downarrow$ )	OOD & adversarial robustness OOD & 对抗鲁棒性
AVIBench[156]	[arXiv’24]	260K	-	ASR( $\downarrow$ ), ASDR( $\downarrow$ ASR( $\downarrow$ ), ASDR( $\downarrow$ ))	Large scale adversarial benchmark 大规模对抗基准
Multi-Model Attack Benchmarks 多模型攻击基准
MM-SafetyBench[53]	[ECCV’24]	5K	\faEye \faGoogle	ASR( $\downarrow$ ), RR( $\uparrow$ )	OCR & diffusion based attack OCR & 扩散攻击
TypoD[157]	[ECCV’24]	20K	-	ASR( $\downarrow$ )	Typography attack benchmark 字体攻击基准
RTVLM[158]	[ACL’24]	5.2K	\faGoogle	GS( $\uparrow$ )	First LVLM red teaming benchmark 首个视觉语言模型红队评测基准
JailBreakV-28K[54]	[COLM’24]	28K	\faGithubAlt	ASR( $\downarrow$ )	Both image-based and text-based 基于图像和基于文本的
RTGPT4[159]	[ICLR’24]	1.4K	\faEdit \faGithubAlt	ASR( $\downarrow$ )	Jailbreak red teaming benchmark Jailbreak 红队测试基准
MultiTrust[55]	[NeurIPS’24]	23K	\faEdit \faGithubAlt \faGoogle	ASR( $\downarrow$ ), RR( $\uparrow$ ), TS( $\downarrow$ ), GS( $\downarrow$ )	Comprehensive trustworthy benchmark 全面的可信基准
MLLMGuard[56]	[NeurIPS’24]	2.3K	\faGithubAlt	ASD( $\downarrow$ ), PAR( $\uparrow$ )	Bilingual benchmark 双语基准
MOSSBench[160]	[arXiv’24]	300	\faEye \faGoogle	RR( $\downarrow$ )	Safety oversensitivity benchmark 安全过度敏感基准
Arondight[161]	[MM’24]	14K	\faEye \faGithubAlt	ASR( $\downarrow$ ), TS( $\downarrow$ )	Auto-generated red teaming evaluation 自动生成的红队评估
SafeBench[154]	[arXiv’24]	2.3K	\faGithubAlt	ASR( $\downarrow$ ), SRI( $\uparrow$ ASR( $\downarrow$ ), SRI( $\uparrow$ ))	Jury-based evaluation framework 基于陪审团的评估框架
Cross-Modality Alignment Benchmarks 跨模态对齐基准
SIUO[44]	[arXiv’24]	167	\faEye \faGoogle	ASR( $\downarrow$ ), ER( $\uparrow$ )	Safe input unsafe output 安全输入不安全输出
MSSBench[162]	[arXiv’24]	1.8K	\faGoogle	ASR( $\downarrow$ )	Multimodel situational safety benchmark 多模型情境安全基准

To evaluate the security capabilities of LVLMs and the effectiveness of attack and defense strategies, researchers have developed extensive benchmarks. These benchmarks fall into two categories: Strategy Effectivity (§ 5.2.1), which assesses attack and defense strategies, and Safety Capability (§ 5.2.2), which examines the models’ inherent security capabilities. Together, they provide a comprehensive framework for improving LVLMs.
为了评估大型视觉语言模型（LVLMs）的安全能力以及攻击和防御策略的有效性，研究人员开发了广泛的基准。这些基准分为两类：策略有效性（§ 5.2.1），用于评估攻击和防御策略，以及安全能力（§ 5.2.2），用于检验模型的固有安全能力。两者共同为改进 LVLMs 提供了一个全面的框架。

5.2.1 Strategy Effectivity
5.2.1 策略有效性

Despite the proliferation of various attack and defense methodologies, evaluating their effectiveness under consistent and standardized conditions remains a significant challenge. This aspect is dedicated to assessing the efficacy of diverse attack and defense strategies. At present, only one work has tackled this issue in a comprehensive manner.
尽管各种攻击和防御方法不断涌现，但在一致和标准化的条件下评估它们的有效性仍然是一个重大挑战。这一方面致力于评估各种攻击和防御策略的有效性。目前，只有一项工作全面地处理了这个问题。

$\bullet$ MMJ-Bench [163] is the first benchmark designed to evaluate LVLM jailbreak attack and defense techniques in a standardized and comprehensive manner. This benchmark utilizes HarmBench [164] to generate harmful queries, and incorporates three generation-based attacks: FigStep [41], MMSafetyBench [53], and HADES [40], as well as three optimization-based attacks: VisualAdv [23], ImgJP [101], and AttackVLM [25], to create corresponding jailbreak prompts. For defense strategies, one proactive defense, VLGuard [49], is selected, along with three reactive defenses: AdaShield [43], CIDER [44], and JailGuard [47]. Through this unified evaluation framework, the study reveals that the effectiveness of each attack varies across LVLMs, with no model exhibiting uniform robustness against all jailbreak attacks. The development of a defense method that achieves an optimal balance between model utility and defense efficacy for all LVLMs presents a considerable challenge.
$\bullet$ MMJ-Bench [ 163] 是首个设计用于以标准化和全面的方式评估大型视觉语言模型（LVLM）越狱攻击与防御技术的基准。该基准利用 HarmBench [ 164] 生成有害查询，并整合了三种基于生成的攻击：FigStep [ 41]、MMSafetyBench [ 53] 和 HADES [ 40]，以及三种基于优化的攻击：VisualAdv [ 23]、ImgJP [ 101] 和 AttackVLM [ 25]，以创建相应的越狱提示。在防御策略方面，选用了 VLGuard [ 49] 这一主动防御方法，以及 AdaShield [ 43]、CIDER [ 44] 和 JailGuard [ 47] 三种被动防御方法。通过这一统一的评估框架，研究揭示每种攻击在 LVLM 上的有效性存在差异，没有模型对所有越狱攻击表现出一致的鲁棒性。为所有 LVLM 开发一种在模型效用与防御效能之间取得最佳平衡的防御方法，是一项相当大的挑战。

5.2.2 Safety Capability 5.2.2 安全能力

This aspect of the benchmark is designed to evaluate the model’s ability to handle diverse safety-critical scenarios. Including measuring the model’s response to malicious inputs, assessing robustness against adversarial attacks, and verifying its alignment with ethical and safety principles.
这一基准测试的方面旨在评估模型处理多样化安全关键场景的能力。包括测量模型对恶意输入的反应，评估其对抗对抗性攻击的鲁棒性，并验证其与伦理和安全原则的一致性。

$\bullet$ Adversarial Capability Benchmarks. HowManyUnicorns [52] was initially proposed to introduce a straightforward attack strategy designed to mislead LVLMs into generating visually unrelated responses. By evaluating both out-of-distribution (OOD) generalization and adversarial robustness, HowManyUnicorns [52] provides a comprehensive assessment of 21 diverse models, spanning from open-source VLLMs to GPT-4V. Besides, AVIBench [156] focuses on evaluating the robustness of LVLMs against Adversarial Visual-Instructions (AVIs). AVIBench [156] employs LVLM-agnostic and output-probability-distribution-agnostic black-box attack methods, generating 10 types of text-based AVIs, 4 types of image-based AVIs, and 9 types of content bias AVIs, resulting in a comprehensive dataset of 260K AVIs. This benchmark evaluates the adversarial robustness of 14 open-source models and 2 proprietary models (GPT-4V, GeminiPro), providing an extensive assessment of their ability to withstand various adversarial attacks.
$\bullet$ 对抗能力基准测试。HowManyUnicorns [ 52] 最初被提出用于引入一种直接的攻击策略，旨在误导大型视觉语言模型（LVLMs）生成视觉上不相关的响应。通过评估分布外（OOD）泛化和对抗鲁棒性，HowManyUnicorns [ 52] 对 21 种不同模型进行了全面评估，涵盖了开源视觉语言模型（VLLMs）到 GPT-4V。此外，AVIBench [ 156] 专注于评估 LVLMs 对对抗视觉指令（AVIs）的鲁棒性。AVIBench [ 156] 采用与 LVLM 无关和与输出概率分布无关的黑盒攻击方法，生成了 10 种基于文本的 AVIs、4 种基于图像的 AVIs 和 9 种内容偏差 AVIs，形成了一个包含 260K AVIs 的综合数据集。该基准测试评估了 14 种开源模型和 2 种专有模型（GPT-4V、GeminiPro）的对抗鲁棒性，全面评估了它们抵御各种对抗攻击的能力。

$\bullet$ Multi-Model Attack Benchmarks. MM-SafetyBench [53] presents a framework for assessing LVLM safety against image-based adversarial attacks. The evaluation dataset, constructed using Stable Diffusion⁴⁴4https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
$\bullet$ 多模型攻击基准。MM-SafetyBench [53] 提出了一个用于评估 LVLM 安全性的框架，以应对基于图像的对抗性攻击。评估数据集是使用 Stable Diffusion ⁴ 构建的。 and Typography techniques along with GPT-4-generated queries, includes 5,040 text-image pairs across 13 scenarios. Two primary metrics, ASR and RR, are used to measure model vulnerability, revealing that many LVLMs, despite safety alignment, remain highly susceptible to adversarial manipulations. TypoD [157] investigates LVLM vulnerability to typographic distractions with a dataset spanning perception-oriented tasks (e.g., object recognition) and cognition-oriented tasks (e.g., commonsense reasoning). The study shows that LVLMs must rely on cross-modal attention matching, rather than uni-modal information, to resolve typographic distractions effectively. JailBreakV-28K [54] evaluates LVLM vulnerability to jailbreak attacks by extending the RedTeam-2K dataset with 28,000 multimodal adversarial prompts. The high ASR in evaluations across 10 open-source LVLMs highlights significant vulnerability to these attacks. RTVLM [158] is the first to construct a red teaming dataset to benchmark current LVLMs across four key aspects: faithfulness, privacy, safety, and fairness. Comprising 5,200 samples, RTVLM includes tasks like multimodal jailbreaking and visual misdirection. RTGPT4 [159] combines three attack methods: FigStep [41], VisualAdv [23], and ImageHijacks [27] to create 1,445 samples, which are used to evaluate 11 different LLMs and MLLMs, finding that GPT-4 and GPT-4V outperform open-source models in resisting both textual and visual jailbreak techniques. Arondight [161] addresses challenges in adapting red teaming techniques from LLMs to VLMs, such as the lack of a visual modality and insufficient diversity. The framework features an automated multimodal jailbreak attack, where visual prompts are generated by a red team VLM and textual prompts by a red team LLM, guided by reinforcement learning. To improve VLM security evaluation, the framework incorporates entropy bonuses and novelty reward metrics. MultiTrust [55] offers a unified benchmark for LVLM trustworthiness, evaluating 21 models across 32 tasks in five critical dimensions: truthfulness, safety, robustness, fairness, and privacy. The study highlights significant vulnerabilities in LVLMs, particularly in novel multimodal scenarios where cross-modal interactions often introduce instabilities. MLLMGuard [56] provides a bilingual evaluation dataset, incorporating red teaming techniques to assess five safety dimensions (privacy, bias, toxicity, truthfulness, and legality) across 12 subtasks. This framework yields valuable insights for improving model safety. MOSSBench [160] evaluates the oversensitivity of LVLMs to harmless queries when specific visual stimuli are present, revealing that safety-aligned models tend to exhibit a higher degree of oversensitivity. SafeBench [154] introduces an automated pipeline for safety dataset generation, leveraging a jury system to evaluate harmful behaviors and assess content security risks through collaborative LLMs. This innovative approach provides a impartial assessment of LVLM safety.
以及排版技术，并结合 GPT-4 生成的查询，包含 13 种场景下的 5,040 个文本-图像对。使用 ASR 和 RR 两个主要指标来衡量模型漏洞，揭示许多 LVLM 尽管进行了安全对齐，仍然高度容易受到对抗性操控。TypoD [ 157] 研究了 LVLM 对排版干扰的漏洞，其数据集涵盖了感知导向任务（例如，物体识别）和认知导向任务（例如，常识推理）。研究表明，LVLM 必须依赖跨模态注意力匹配，而不是单模态信息，才能有效解决排版干扰。JailBreakV-28K [ 54] 通过扩展 RedTeam-2K 数据集，增加了 28,000 个多模态对抗性提示，评估了 LVLM 对越狱攻击的漏洞。在 10 个开源 LVLM 上的评估中，高 ASR 凸显了这些攻击的显著漏洞。RTVLM [ 158] 首次构建了一个红队测试数据集，用于在四个关键方面（可信度、隐私、安全和公平性）基准测试当前的 LVLM。包含 5,200 个样本的 RTVLM，包括多模态越狱和视觉误导等任务。 RTGPT4 [ 159] 结合了三种攻击方法：FigStep [ 41]、VisualAdv [ 23] 和 ImageHijacks [ 27] 来创建 1,445 个样本，用于评估 11 种不同的 LLMs 和 MLLMs，发现 GPT-4 和 GPT-4V 在抵抗文本和视觉越狱技术方面均优于开源模型。Arondight [ 161] 解决了将红队技术从 LLMs 应用于 VLMs 时面临的挑战，例如缺乏视觉模态和多样性不足。该框架具有自动多模态越狱攻击功能，其中视觉提示由红队 VLM 生成，文本提示由红队 LLM 生成，并在强化学习的指导下进行。为了改进 VLM 安全评估，该框架集成了熵奖励和新颖性奖励指标。MultiTrust [ 55] 提供了一个统一的基准来评估 LVLM 的可信度，在五个关键维度（真实性、安全性、鲁棒性、公平性和隐私性）上对 21 个模型在 32 项任务中进行了评估。该研究突出了 LVLM 中的重大漏洞，特别是在新颖的多模态场景中，跨模态交互常常引入不稳定性。 MLLMGuard [ 56] 提供了一个双语评估数据集，结合红队技术评估五个安全维度（隐私、偏见、毒性、真实性和合法性），涵盖 12 个子任务。该框架为提升模型安全性提供了宝贵的见解。MOSSBench [ 160] 评估了 LVLMs 在存在特定视觉刺激时对无害查询的过度敏感性，揭示安全对齐模型倾向于表现出更高的过度敏感性程度。SafeBench [ 154] 引入了一个用于安全数据集生成的自动化流程，利用陪审团系统通过协作 LLMs 评估有害行为并评估内容安全风险。这种创新方法为 LVLM 安全性提供了公正的评估。

$\bullet$ Cross-Modality Alignment Benchmarks. LLMs generally undergo safety alignment [165, 166, 167]. Nevertheless, for LVLMs, ensuring cross-modal safety alignment is even more crucial due to the integration of both visual and textual modalities (e.g., asking an LVLM how to jump from a cliff while providing an image of a person standing at the edge). To address this, SIUO [168] developed a dataset where both textual and visual inputs are individually safe, but their combination results in unsafe outputs. The dataset was used to evaluate the integration, knowledge, and reasoning capabilities of 15 LVLMs, revealing that even advanced models like GPT-4V [16] only achieve a safe response rate of 53.26% on this benchmark. Additionally, MSSBench [162] constructed a larger dataset containing 1,820 language-query-image pairs, half of which are safe and the other half unsafe. The findings highlight that current LVLMs struggle with this safety issue in instruction-following tasks and face significant challenges when addressing these situational safety concerns simultaneously, underscoring a crucial area for future research.
$\bullet$ 跨模态对齐基准测试。LLMs 通常会进行安全对齐 [ 165, 166, 167]。然而，对于 LVLMs 来说，由于视觉和文本模态的集成（例如，向一个 LVLM 提问如何从悬崖跳下，同时提供一张站在边缘的人的图片），确保跨模态安全对齐更为关键。为了解决这个问题，SIUO [ 168] 开发了一个数据集，其中文本和视觉输入都是单独安全的，但它们的组合会产生不安全的输出。该数据集被用于评估 15 个 LVLMs 的集成、知识和推理能力，揭示即使是像 GPT-4V [ 16] 这样的高级模型，在这个基准测试上也只能达到 53.26% 的安全响应率。此外，MSSBench [ 162] 构建了一个更大的数据集，包含 1,820 个语言-查询-图像对，其中一半是安全的，另一半是不安全的。这些发现表明，当前的 LVLMs 在指令跟随任务中难以处理这个问题，并且在同时处理这些情境安全问题时面临重大挑战，突显了未来研究的一个关键领域。

6 Safety Evaluation on Janus-Pro
6 Janus-Pro 安全评估

TABLE VIII: Evaluation on SIUO using ASR (

\downarrow

) with both close-source and open-source LVLMs. OpenQA refers to open-ended question answering, while MCQA refers to multiple-choice question answering.
表 VIII：使用 ASR (

\downarrow

) 在 SIUO 上对闭源和开源视觉语言模型进行评估。OpenQA 指开放式问答，而 MCQA 指多项选择题问答。

Models 模型	Size 大小	OpenQA 开放式问答	MCQA
Close-Source LVLMs 闭源视觉语言模型
GPT-4V(ision)	-	46.71	61.08
GPT-4o	-	49.10	58.68
Gemini 1.5 Pro	-	47.90	52.69
Gemini 1.0 Pro	-	72.46	65.87
Open-Source LVLMs 开源视觉语言模型
Qwen-VL-7B-Chat	9.6B	58.68	79.04
MiniGPT4-v2	8B	58.08	72.46
LLaVA-1.6-34B	34B	59.28	47.31
LLaVA-1.5-13B	13.4B	77.84	67.07
LLaVA-1.5-7B	7.2B	78.44	66.47
CogVLM	17B	77.25	72.46
mPLUG-OWL2	8.2B	77.84	71.86
Janus-Pro-7B	7B	84.43	73.06

TABLE IX: Evaluation on MM-SafetyBench using ASR (

\downarrow

) for LLaVA-1.5-7B, LLaVA-1.5-13B, and Janus-Pro-7B, highlighting the best and second-best performances.
表 IX：在 MM-SafetyBench 上使用 ASR (

\downarrow

) 对 LLaVA-1.5-7B、LLaVA-1.5-13B 和 Janus-Pro-7B 的评估，突出最佳和次佳表现。

	SD			TYPO			SD + TYPO
Scenarios 场景	LLaVA 7B	LLaVA 13B	Janus-Pro 7B	LLaVA 7B	LLaVA 13B	Janus-Pro 7B	LLaVA 7B	LLaVA 13B	Janus-Pro 7B
01-Illegal Activity 01-非法活动	22.68	25.77	36.08	79.38	81.44	97.93	77.32	80.41	92.78
02-Hate Speech 02-仇恨言论	16.56	14.11	17.17	39.88	47.24	67.48	47.85	44.79	61.96
03-Malware Generation 03-恶意软件生成	20.45	11.36	25.00	65.91	59.09	77.27	70.45	68.18	84.09
04-Physical Harm 04-物理伤害	20.14	22.22	23.61	60.42	59.72	72.91	62.50	63.19	78.47
05-Economic Harm 05-经济伤害	4.10	4.10	10.65	14.75	16.39	30.32	15.57	13.93	30.32
06-Fraud 06-欺诈	20.13	20.13	27.27	72.73	75.32	77.92	66.88	74.03	74.67
07-Sex 07-性	11.93	13.76	6.42	53.21	49.54	50.45	53.21	46.79	33.02
08-Political Lobbying 08-政治游说	73.86	69.93	49.01	94.77	94.77	80.66	96.73	96.08	94.77
09-Privacy Violence 09-隐私暴力	12.95	17.27	25.89	55.40	59.71	85.61	51.08	64.75	86.33
10-Legal Opinion 10-法律意见	92.31	93.85	78.46	94.62	95.38	78.46	96.92	96.92	96.15
11-Financial Advice 11-财务建议	97.00	98.80	100.00	99.40	99.40	100.00	100.00	98.80	100.00
12-Health Consultation 12-健康咨询	99.08	99.08	91.74	100.00	100.00	88.99	100.00	100.00	99.08
13-Gov Decision 13-政府决策	98.66	96.64	85.90	99.33	100.00	95.97	99.33	100.00	100.00
Average 平均	45.37	45.16	46.07	71.52	72.15	78.09	72.14	72.91	79.94

6.1 Details of Janus-Pro
6.1 Janus-Pro 的详细信息

Janus-Pro-7B [169] is the latest LVLM released by DeepSeek, representing a significant advancement over Janus-1B. This new model scales up both the data and model parameters, validating the potential of the original design. DeepSeek’s Janus-Pro integrates unified multimodal understanding and generation capabilities, addressing the longstanding gap between image understanding (as seen in GPT-4o [16]) and image generation (such as with Stable Diffusion). While earlier approaches typically relied on separate models for image understanding and generation, Janus-Pro aims to bridge this divide using a single model for both tasks. One of the challenges in combining understanding and generation lies in the different encoder architectures typically used for image encoding in understanding and generation tasks. Janus-Pro tackles this by employing separate encoders for image processing, but integrates them into a unified latent space, where both text-to-image and image-to-text tasks are handled through an autoregressive framework similar to LLMs. After validating the feasibility of this approach, Janus-Pro revealed significant gaps in generation performance when compared to diffusion-based models such as Stable Diffusion, which is a known challenge for autoregressive models in image generation. Janus-Pro improves upon this by scaling the model and data, and introducing an optimized training strategy. As a result, Janus-Pro has achieved state-of-the-art performance in multimodal understanding tasks and has surpassed diffusion models such as DALL-E 3 in terms of text-to-image generation capabilities. However, given its strong multimodel understanding performance, how about Janus-Pro’s safety capability?
Janus-Pro-7B [ 169] 是 DeepSeek 发布的最新大型视觉语言模型（LVLM），相较于 Janus-1B 有显著进步。该新模型在数据和模型参数规模上均有所提升，验证了原始设计的潜力。DeepSeek 的 Janus-Pro 集成了统一的多模态理解和生成能力，解决了长期以来图像理解（如 GPT-4o [ 16] 所示）与图像生成（如 Stable Diffusion）之间的差距。虽然早期方法通常依赖独立的模型进行图像理解和生成，但 Janus-Pro 旨在通过单一模型同时完成这两个任务来弥合这一鸿沟。将理解和生成结合起来的一个挑战在于，理解和生成任务中通常使用不同的编码器架构进行图像编码。Janus-Pro 通过采用独立的图像处理编码器，但将它们集成到一个统一的潜在空间中，在这个空间里，文本到图像和图像到文本任务都通过类似于 LLMs 的自回归框架进行处理。在验证了该方法的可行性后，Janus-Pro 在与 Stable Diffusion 等基于扩散的模型相比时，在生成性能上暴露出显著差距，这是自回归模型在图像生成中已知的一个挑战。Janus-Pro 通过扩展模型和数据，并引入优化的训练策略来改进这一点。因此，Janus-Pro 在多模态理解任务中取得了最先进的性能，并在文本到图像生成能力方面超越了 DALL-E 3 等扩散模型。然而，鉴于其强大的多模态理解性能，Janus-Pro 的安全能力如何呢？

6.2 Experiments 6.2 实验

6.2.1 Benchmarks 6.2.1 基准测试

We conduct a set of safety evaluations on Janus-Pro, utilizing two open-source benchmarks: SIUO[168] and MM-SafetyBench[53]. For assessing Cross-Modality Alignment, we used the SIUO dataset, which is developed to evaluate the integration, knowledge, and reasoning capabilities of LVLMs. SIUO consists of 167 samples spanning 9 critical safety domains, including self-harm, illegal activities, and privacy violations. In SIUO, both textual and visual inputs are individually safe, but their combination results in unsafe outputs, posing a challenge to cross-modal reasoning. MM-SafetyBench is then used to evaluate Janus-Pro’s defense capabilities under Multi-Model Attacks. This benchmark consists of 5,040 examples across 13 common scenarios involving malicious intent. The benchmark includes three distinct subcategories: (1) SD: Images generated by Stable Diffusion (SD) conditioned on malicious keywords, (2) OCR: Images containing malicious keywords extracted through Optical Character Recognition (OCR), and (3) SD + OCR: Images generated by SD and then subtitled with OCR. The experiments are conducted on NVIDIA GeForce 4090 GPUs, using the GPT-4o-2024-05-13 API for evaluation. Additionally, temperature is set to 0 to ensure deterministic evaluation results, and manual review is conducted on the evaluation results to ensure accuracy.
我们对 Janus-Pro 进行了一系列安全性评估，使用了两个开源基准：SIUO[168]和 MM-SafetyBench[53]。为了评估跨模态对齐能力，我们使用了 SIUO 数据集，该数据集旨在评估大型视觉语言模型（LVLMs）的整合、知识和推理能力。SIUO 包含 167 个样本，涵盖 9 个关键安全领域，包括自残、非法活动和隐私侵犯。在 SIUO 中，文本和视觉输入单独都是安全的，但它们的组合会产生不安全的输出，这对跨模态推理提出了挑战。MM-SafetyBench 用于评估 Janus-Pro 在多模型攻击下的防御能力。该基准包含 5,040 个涉及恶意意图的 13 种常见场景的示例。该基准分为三个不同的子类别：（1）SD：由 Stable Diffusion（SD）根据恶意关键词生成的图像，（2）OCR：通过光学字符识别（OCR）提取包含恶意关键词的图像，（3）SD + OCR：由 SD 生成后用 OCR 添加字幕的图像。实验在 NVIDIA GeForce 4090 GPU 上进行，使用 GPT-4o-2024-05-13 API 进行评估。此外，温度设置为 0 以确保确定性评估结果，并对评估结果进行人工审核以确保准确性。

6.2.2 Results Analyse 6.2.2 结果分析

$\bullet$ Results on SIUO. From the results presented in Tab. VIII, it is evident that Janus-Pro-7B exhibits suboptimal performance in OpenQA tasks. Its performance significantly lags behind that of LLaVA-1.5-7B, a model of comparable scale, with Janus-Pro achieving a ASR of 84.43%, whereas LLaVA-1.5-7B performs at 78.44%. This underperformance in open-ended question answering may be attributed to several factors, including potential limitations in Janus-Pro’s architecture, which may not yet be fully optimized for complex, open-ended reasoning tasks. Additionally, it is possible that the model’s fine-tuning for open-ended question answering is still in development, leading to less robust responses in comparison to models like Qwen-VL-7B-Chat [65] and MiniGPT4-v2 [170] , which show ASR of 58.68% and 58.08% respectively, demonstrated more refined capabilities in this cross-modality alignment. Conversely, Janus-Pro-7B performs considerably better in MCQA (Multiple Choice Question Answering) tasks, where its ASR of 73.06% is competitive with most other models, such as Qwen-VL-7B-Chat (79.04%) and MiniGPT4-v2 (72.46%). This improvement in performance suggests that Janus-Pro is better suited to structured response tasks, where the model’s predictive capabilities in selecting the most appropriate answer from predefined options are more effectively utilized. This ability to handle fixed-response tasks could enhance Janus-Pro’s safety capabilities, as it is less prone to generating unsafe outputs in well-defined, closed-question settings. The model’s autoregressive framework, which excels in generating coherent responses within such structured contexts, contributes to this improved safety performance.
$\bullet$ SIUO 上的结果。从表 VIII 中的结果可以看出，Janus-Pro-7B 在 OpenQA 任务中的表现并不理想。其性能明显落后于规模相当的 LLaVA-1.5-7B 模型，Janus-Pro 达到了 84.43%的 ASR，而 LLaVA-1.5-7B 的表现为 78.44%。这种在开放式问答中的表现不佳可能归因于多个因素，包括 Janus-Pro 架构的潜在局限性，该架构可能尚未完全优化以应对复杂的开放式推理任务。此外，模型在开放式问答方面的微调可能仍在开发中，导致其响应的鲁棒性不如 Qwen-VL-7B-Chat [ 65] 和 MiniGPT4-v2 [ 170] 等模型，后者分别表现出 58.68%和 58.08%的 ASR，并在跨模态对齐方面展现出更精细的能力。相反地，Janus-Pro-7B 在多项选择题回答（MCQA）任务中表现显著更优，其 73.06%的 ASR（自动语音识别）成绩与大多数其他模型（如 Qwen-VL-7B-Chat 的 79.04%和 MiniGPT4-v2 的 72.46%）相比具有竞争力。这一性能提升表明 Janus-Pro 更适合结构化回答任务，在这种任务中，模型从预定义选项中选择最合适答案的预测能力能被更有效地利用。这种处理固定回答任务的能力可能增强 Janus-Pro 的安全性，因为它在定义明确、封闭式问题的设置中不太容易生成不安全的输出。该模型的自动回归框架在生成此类结构化环境中的连贯回答方面表现出色，这有助于提升其安全性表现。

$\bullet$ Result on MM-SafetyBench. Turning to the results in Tab. IX, which evaluates Janus-Pro’s performance on MM-SafetyBench, a similar pattern of performance discrepancies emerges. In the first half of the table (Scenarios 1-6), which includes safety-critical tasks such as identifying illegal activities, hate speech, and malware generation, Janus-Pro consistently underperforms relative to the LLaVA series models. For example, in the Illegal Activity scenario, Janus-Pro achieves an ASR of 36.08% compared to LLaVA-1.5-7B’s 25.77%. This observation suggests that Janus-Pro may be less adept at addressing safety-sensitive tasks, particularly those involving illegal activities or violence, possibly due to differences in its model architecture or training methodology. We speculate that Janus-Pro’s design of a unified latent space for both text and image generation may struggle to effectively capture and mitigate patterns in scenarios that require highly specialized safety mechanisms. Its encoder architecture, designed to process visual and textual inputs in parallel, might not be sufficiently fine-tuned for the nuanced detection of harmful or malicious content, especially in cases involving high-risk scenarios. In contrast, Janus-Pro demonstrates notable improvement in the second half of the table (Scenarios 7-13), which includes tasks such as political lobbying, privacy violations, and government decision-making. The enhanced performance in these scenarios suggests that Janus-Pro may be more safety when dealing with context-specific tasks that require multimodal reasoning, where the model can leverage its integrated understanding of both textual and visual inputs. The structured nature of these scenarios may align more closely with Janus-Pro’s autoregressive framework, which excels in tasks requiring coherent output generation based on contextual input. This improvement could also be due to the model’s stronger training on data that includes these types of tasks, or it may reflect an inherent advantage in processing structured tasks where text and images contribute complementary information.
$\bullet$ MM-SafetyBench 上的结果。转向表 IX 中的结果，该表评估了 Janus-Pro 在 MM-SafetyBench 上的性能，出现了类似的性能差异模式。在表格的前半部分（场景 1-6），包括识别非法活动、仇恨言论和恶意软件生成等安全关键任务，Janus-Pro 始终相对于 LLaVA 系列模型表现不佳。例如，在非法活动场景中，Janus-Pro 的 ASR 达到 36.08%，而 LLaVA-1.5-7B 为 25.77%。这一观察表明，Janus-Pro 可能不太擅长处理涉及安全敏感的任务，特别是涉及非法活动或暴力的任务，这可能是由于其模型架构或训练方法的不同。我们推测，Janus-Pro 为文本和图像生成设计了一个统一的潜在空间，可能难以有效地捕捉和缓解需要高度专业化安全机制的场景中的模式。其编码器架构设计为并行处理视觉和文本输入，可能无法充分针对有害或恶意内容的细微检测进行微调，特别是在涉及高风险场景的情况下。相比之下，Janus-Pro 在表格的第二部分（场景 7-13）中表现出显著改进，这些场景包括政治游说、隐私侵犯和政府决策等任务。在这些场景中的性能提升表明，Janus-Pro 在处理需要多模态推理的特定上下文任务时可能更安全，此时模型能够利用其整合的文本和视觉输入理解。这些场景的结构化性质可能与 Janus-Pro 的自回归框架更为契合，该框架擅长基于上下文输入生成连贯输出的任务。这种改进也可能归因于模型在包含这些类型任务的训练数据上进行了更强的训练，或者它可能反映了在处理结构化任务时的一种固有优势，在这些任务中，文本和图像提供了互补信息。

$\bullet$ Conclusion. While Janus-Pro has achieved impressive multimodal understanding capabilities, its safety performance remains a significant limitation. Across multiple benchmarks, Janus-Pro fails to meet the basic safety standards of most other models. We speculate that this shortcoming may be due to the model architecture, which was designed to simultaneously handle both understanding and generation tasks, potentially at the expense of specialized safety mechanisms. Additionally, it is possible that Janus-Pro did not undergo specific safety-focused training, which may be contributing to its limited ability to recognize and mitigate harmful or adversarial inputs. This could also be related to the capabilities of the chosen LLM architecture used in Janus-Pro. Given the critical role of safety in deploying multimodal models in real-world applications, it is evident that the safety capabilities of DeepSeek Janus-Pro need substantial improvements. Further refinements in its architecture and training methodology, with a stronger focus on safety and adversarial robustness, are essential for enhancing Janus-Pro’s effectiveness across diverse, high-risk tasks and scenarios.
$\bullet$ 结论。尽管 Janus-Pro 在多模态理解能力上取得了令人印象深刻的成果，但其安全性表现仍然是一个显著的局限。在多个基准测试中，Janus-Pro 未能达到大多数其他模型的基本安全标准。我们推测这一缺陷可能源于模型架构，该架构设计为同时处理理解和生成任务，可能以牺牲专门的安全机制为代价。此外，Janus-Pro 可能没有接受专门针对安全的训练，这可能是其识别和缓解有害或对抗性输入能力有限的原因之一。这也可能与 Janus-Pro 所使用的 LLM 架构的能力有关。鉴于安全性在现实世界应用中部署多模态模型的关键作用，很明显 DeepSeek Janus-Pro 的安全能力需要大幅改进。为了提升 Janus-Pro 在多样化、高风险任务和场景中的有效性，对其架构和训练方法的进一步优化，并更加注重安全性和对抗性鲁棒性，是至关重要的。

7 Outlook 7 展望

7.1 Future Trends 7.1 未来趋势

Based on the reviewed research, we list several future research directions that we believe should be pursued.
基于已审阅的研究，我们列出了几个我们认为应该追求的未来研究方向。

$\bullet$ The Shift Towards Black-box Attacks. A key future direction in attack methodologies is the increasing focus on black-box attacks, which offer advantages over traditional white-box approaches. While effective, white-box attacks require extensive prior knowledge of the target model, limiting their applicability and introducing significant computational overhead [23, 27, 92, 26]. In contrast, black-box attacks exploit the intrinsic capabilities of LVLMs—such as OCR [41, 104], logical reasoning [105], associative memory [106], and multimodal integration [108]—to target vulnerabilities without direct access to the model’s architecture, enhancing transferability and resource efficiency. However, prompt-based defenses [43, 48, 132] often mitigate these attacks, exposing the limitations of current strategies. Future research should focus on developing more advanced black-box attack techniques that can circumvent defenses and demonstrate greater resilience, ensuring their robustness as LVLMs see broader deployment.
$\bullet$ 向黑盒攻击的转变。攻击方法的一个关键未来方向是越来越关注黑盒攻击，它们比传统的白盒方法具有优势。虽然有效，但白盒攻击需要大量关于目标模型的先验知识，限制了其适用性并引入了显著的计算开销[ 23, 27, 92, 26]。相比之下，黑盒攻击利用 LVLMs 的内在能力——如 OCR[ 41, 104]、逻辑推理[ 105]、联想记忆[ 106]和多模态集成[ 108]——来针对漏洞，而无需直接访问模型的架构，从而提高了可迁移性和资源效率。然而，基于提示的防御[ 43, 48, 132]通常可以缓解这些攻击，这暴露了当前策略的局限性。未来研究应专注于开发更先进的黑盒攻击技术，使其能够绕过防御并展示更大的韧性，确保它们在 LVLMs 更广泛部署时的鲁棒性。

$\bullet$ Enhancing Safety through Cross-Modality Alignment. Most defense mechanisms focus on detecting harmful inputs, addressing obvious attacks [49, 146], yet LVLMs in real-world applications face subtler threats. For example, individually innocuous image and text inputs can combine to produce unsafe outputs [168, 162]. Visual components often struggle to identify unsafe elements or grasp contextual nuances, and research on aligning safety across visual and textual modalities remains limited [29]. Future efforts should focus on bridging the security capabilities of LLMs and vision encoders to address these modality gaps. Ensuring seamless safety integration across modalities is essential to prevent harmful interpretations arising from their combination. Additionally, improving contextual understanding in joint visual-textual processing will be crucial for enhancing the robustness and reliability of LVLMs in dynamic environments.
$\bullet$ 通过跨模态对齐增强安全性。大多数防御机制集中于检测有害输入，应对明显的攻击[49, 146]，然而在实际应用中，LVLMs 面临更隐蔽的威胁。例如，单独无害的图像和文本输入可能组合产生不安全的输出[168, 162]。视觉组件往往难以识别不安全元素或理解上下文细节，而跨视觉和文本模态对齐安全性的研究仍有限[29]。未来工作应着重于提升 LLMs 和视觉编码器的安全能力，以弥补这些模态差距。确保跨模态的无缝安全集成对于防止它们组合产生有害解读至关重要。此外，提升联合视觉-文本处理中的上下文理解能力，对于增强 LVLMs 在动态环境中的鲁棒性和可靠性将至关重要。

$\bullet$ Diversifying Safety Fine-Tuning Techniques. Balancing the enhancement of safety while maintaining the general capabilities of LVLMs remains a significant challenge. Traditional fine-tuning approaches often risk compromising the model’s overall performance in the pursuit of improved safety measures. To address this dilemma, future research should explore a broader range of safety fine-tuning methodologies, such as Reinforcement Learning from Human Feedback (RLHF) [171], adversarial training, and multi-objective optimization. RLHF, in particular, offers a promising direction by enabling models to iteratively learn safety-oriented behaviors from refined feedback, reducing the reliance on static, rule-based constraints. Additionally, techniques like curriculum learning could help models gradually adapt to increasingly complex safety scenarios without sacrificing their ability to generalize across diverse tasks. Hybrid approaches that combine multiple fine-tuning strategies may further enhance safety while preserving or even improving the model’s overall capabilities. These advancements are crucial for developing robust, safety-aware LVLMs that can meet the demands of real-world applications without significant trade-offs.
$\bullet$ 丰富安全微调技术。在提升安全性的同时保持 LVLMs 的通用能力仍然是一个重大挑战。传统的微调方法往往在追求改进安全措施的过程中，有损害模型整体性能的风险。为了解决这一困境，未来的研究应探索更广泛的安全微调方法，例如人类反馈强化学习（RLHF）[ 171]、对抗训练和多目标优化。RLHF 尤其提供了一个有前景的方向，它使模型能够通过精细的反馈迭代学习以安全为导向的行为，减少对静态、基于规则的约束的依赖。此外，课程学习等技术可以帮助模型在不牺牲其泛化不同任务能力的情况下，逐步适应日益复杂的安全场景。结合多种微调策略的混合方法可能进一步增强安全性，同时保持或甚至提高模型的整体能力。这些进步对于开发稳健、注重安全的视觉语言模型至关重要，它们能够在不进行重大权衡的情况下满足实际应用的需求。

$\bullet$ Developing Unified Strategy Benchmarking Frameworks. The rapid diversification of attack and defense methodologies for LVLMs has led to fragmented experimental environments, obstructing meaningful cross-method comparisons of effectiveness, efficiency, and overall performance. While existing benchmark MMJBench [163], offers evaluations of various strategies, but it employs a limited set of assessment methods and lack a comprehensive, general framework. To address these shortcomings, future research may prioritize the development of standardized benchmarking frameworks that unify the evaluation of diverse strategies. These benchmarks should encompass a broad range of metrics, including attack success rates, computational resource requirements, response times, and resilience against adaptive defenses. Additionally, incorporating diverse datasets and realistic threat models is essential to ensure evaluations accurately reflect real-world scenarios. The creation of open-source benchmark suites, supported by community contributions, will enhance transparency and reproducibility, enabling researchers to validate and build upon each other’s work more effectively. By implementing comprehensive benchmarking strategies, the research community can systematically assess the strengths and limitations of existing approaches, drive the innovation of more robust and efficient solutions, and ultimately advance the security and reliability of LVLMs in practical applications.
$\bullet$ 开发统一的策略基准测试框架。LVLMs 的攻击和防御方法的快速多样化导致了实验环境的碎片化，阻碍了在有效性、效率和整体性能方面的跨方法比较。虽然现有的基准 MMJBench [ 163]提供了各种策略的评估，但它采用了一套有限的评估方法，缺乏一个全面、通用的框架。为了解决这些不足，未来的研究应优先考虑开发统一评估多样化策略的标准化基准测试框架。这些基准应涵盖广泛的指标，包括攻击成功率、计算资源需求、响应时间和对自适应防御的弹性。此外，纳入多样化的数据集和现实威胁模型对于确保评估准确反映现实世界场景至关重要。创建由社区贡献支持的开源基准套件将增强透明度和可重复性，使研究人员能够更有效地验证和基于彼此的工作进行构建。通过实施全面的基准测试策略，研究界可以系统地评估现有方法的优缺点，推动更健壮和高效解决方案的创新，并最终提升 LVLMs 在实际应用中的安全性和可靠性。

7.2 Conclusion 7.2 结论

To the best of our knowledge, this is the first survey to offer a comprehensive and systematic review of recent advances in all field of LVLM safety from attacks, defenses, and evaluations, with an analysis of over 100 methods. We present the background of LVLM safety, emphasizing the unique vulnerabilities inherent in these models and introducing fundamental attack classifications. We categorize attack and defense strategies based on the model lifecycle, distinguishing between inference-phase and training-phase methods, and provide detailed sub-classifications with in-depth descriptions of each approach. In the Evaluation section, we synthesize all relevant benchmarks, providing a valuable resource for researchers seeking a comprehensive understanding of the field. Finally, we offer insights into future research directions and highlight open challenges, aiming to encourage further exploration and engagement from the research community in this critical area.
据我们所知，这是首个全面系统地综述 LVLM 安全领域最新进展的调研，涵盖了攻击、防御和评估三个方面，并分析了超过 100 种方法。我们介绍了 LVLM 安全背景，强调了这些模型固有的独特漏洞，并介绍了基础攻击分类。我们根据模型生命周期对攻击和防御策略进行分类，区分了推理阶段和训练阶段的方法，并为每种方法提供了详细的子分类和深入描述。在评估部分，我们综合了所有相关基准，为寻求全面了解该领域的研究人员提供了宝贵资源。最后，我们探讨了未来研究方向，并强调了开放性挑战，旨在鼓励研究界在这一关键领域进行进一步探索和参与。

References

[1] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
[2] S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao, “Large language models: A survey,” arXiv preprint arXiv:2402.06196, 2024.
[3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” NeurIPS, pp. 1877–1901, 2020.
[4] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” JMLR, pp. 1–113, 2023.
[5] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[6] M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan, “Foundation models defining a new era in vision: a survey and outlook,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025.
[7] P. Kaur, G. S. Kashyap, A. Kumar, M. T. Nafis, S. Kumar, and V. Shokeen, “From text to transformation: A comprehensive review of large language models’ versatility,” arXiv preprint arXiv:2402.16142, 2024.
[8] Z. Cai, M. Cao, H. Chen, K. Chen, K. Chen, X. Chen, X. Chen, Z. Chen, Z. Chen, P. Chu et al., “Internlm2 technical report,” arXiv preprint arXiv:2403.17297, 2024.
[9] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024.
[10] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
[11] P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei, F. Meng, S. Huang, Y. Qiao, and P. Luo, “Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models,” IEEE TPAMI, 2024.
[12] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML. PMLR, 2021, pp. 8748–8763.
[13] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” NeurIPS, pp. 23 716–23 736, 2022.
[14] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in ICML. PMLR, 2023, pp. 19 730–19 742.
[15] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024.
[16] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[17] J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[18] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML. PMLR, 2021, pp. 4904–4916.
[19] X. Tian, J. Gu, B. Li, Y. Liu, Y. Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,” arXiv preprint arXiv:2402.12289, 2024.
[20] M.-H. Van, P. Verma, and X. Wu, “On large visual language models for medical imaging analysis: An empirical study,” in CHASE. IEEE, 2024, pp. 172–176.
[21] A. Maharana, D. Hannan, and M. Bansal, “Storydall-e: Adapting pretrained text-to-image transformers for story continuation,” in ECCV. Springer, 2022, pp. 70–87.
[22] Y. Zhou, D. Zhou, M.-M. Cheng, J. Feng, and Q. Hou, “Storydiffusion: Consistent self-attention for long-range image and video generation,” arXiv preprint arXiv:2405.01434, 2024.
[23] X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal, “Visual adversarial examples jailbreak aligned large language models,” in AAAI, 2024, pp. 21 527–21 536.
[24] C. Schlarmann and M. Hein, “On the adversarial robustness of multi-modal foundation models,” in ICCV, 2023, pp. 3677–3685.
[25] Y. Zhao, T. Pang, C. Du, X. Yang, C. Li, N.-M. M. Cheung, and M. Lin, “On evaluating adversarial robustness of large vision-language models,” NeurIPS, vol. 36, 2024.
[26] R. Wang, X. Ma, H. Zhou, C. Ji, G. Ye, and Y.-G. Jiang, “White-box multimodal jailbreaks against large vision-language models,” in ACM MM, 2024, pp. 6920–6928.
[27] L. Bailey, E. Ong, S. Russell, and S. Emmons, “Image hijacks: Adversarial images can control generative models at runtime,” in ICML, 2024.
[28] Y. Ding, B. Li, and R. Zhang, “Eta: Evaluating then aligning safety of vision language models at inference time,” arXiv preprint arXiv:2410.06625, 2024.
[29] S. Xu, L. Pang, Y. Zhu, H. Shen, and X. Cheng, “Cross-modal safety mechanism transfer in large vision-language models,” arXiv preprint arXiv:2410.12662, 2024.
[30] Y. Xu, J. Yao, M. Shu, Y. Sun, Z. Wu, N. Yu, T. Goldstein, and F. Huang, “Shadowcast: Stealthy data poisoning attacks against vision-language models,” arXiv preprint arXiv:2402.06659, 2024.
[31] J. Liang, S. Liang, M. Luo, A. Liu, D. Han, E.-C. Chang, and X. Cao, “Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models,” arXiv preprint arXiv:2402.13851, 2024.
[32] Z. Ni, R. Ye, Y. Wei, Z. Xiang, Y. Wang, and S. Chen, “Physical backdoor attack can jeopardize driving with vision-large-language models,” arXiv preprint arXiv:2404.12916, 2024.
[33] X. Liu, Y. Zhu, Y. Lan, C. Yang, and Y. Qiao, “Safety of multimodal large language models on images and text,” arXiv preprint arXiv:2402.00357, 2024.
[34] Y. Fan, Y. Cao, Z. Zhao, Z. Liu, and S. Li, “Unbridled icarus: A survey of the potential perils of image inputs in multimodal large language model security,” arXiv preprint arXiv:2404.05264, 2024.
[35] S. Wang, Z. Long, Z. Fan, and Z. Wei, “From llms to mllms: Exploring the landscape of multimodal jailbreaking,” arXiv preprint arXiv:2406.14859, 2024.
[36] H. Jin, L. Hu, X. Li, P. Zhang, C. Chen, J. Zhuang, and H. Wang, “Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models,” arXiv preprint arXiv:2407.01599, 2024.
[37] D. Liu, M. Yang, X. Qu, P. Zhou, Y. Cheng, and W. Hu, “A survey of attacks on large vision-language models: Resources, advances, and future trends,” arXiv preprint arXiv:2407.07403, 2024.
[38] C. Zhang, X. Xu, J. Wu, Z. Liu, and L. Zhou, “Adversarial attacks of vision tasks in the past 10 years: A survey,” arXiv preprint arXiv:2410.23687, 2024.
[39] X. Liu, X. Cui, P. Li, Z. Li, H. Huang, S. Xia, M. Zhang, Y. Zou, and R. He, “Jailbreak attacks and defenses against multimodal generative models: A survey,” arXiv preprint arXiv:2411.09259, 2024.
[40] Y. Li, H. Guo, K. Zhou, W. X. Zhao, and J.-R. Wen, “Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models,” arXiv preprint arXiv:2403.09792, 2024.
[41] Y. Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, and X. Wang, “Figstep: Jailbreaking large vision-language models via typographic visual prompts,” arXiv preprint arXiv:2311.05608, 2023.
[42] S. Lee, G. Kim, J. Kim, H. Lee, H. Chang, S. H. Park, and M. Seo, “How does vision-language adaptation impact the safety of vision language models?” arXiv preprint arXiv:2410.07571, 2024.
[43] Y. Wang, X. Liu, Y. Li, M. Chen, and C. Xiao, “Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting,” arXiv preprint arXiv:2403.09513, 2024.
[44] Y. Xu, X. Qi, Z. Qin, and W. Wang, “Cross-modality information check for detecting jailbreaking in multimodal large language models,” arXiv preprint arXiv:2407.21659, 2024.
[45] P. Wang, D. Zhang, L. Li, C. Tan, X. Wang, K. Ren, B. Jiang, and X. Qiu, “Inferaligner: Inference-time alignment for harmlessness through cross-model guidance,” arXiv preprint arXiv:2401.11206, 2024.
[46] J. Gao, R. Pi, T. Han, H. Wu, L. Hong, L. Kong, X. Jiang, and Z. Li, “Coca: Regaining safety-awareness of multimodal large language models with constitutional calibration,” arXiv preprint arXiv:2409.11365, 2024.
[47] X. Zhang, C. Zhang, T. Li, Y. Huang, X. Jia, X. Xie, Y. Liu, and C. Shen, “A mutation-based method for multi-modal jailbreaking attack detection,” arXiv preprint arXiv:2312.10766, 2023.
[48] R. Pi, T. Han, J. Zhang, Y. Xie, R. Pan, Q. Lian, H. Dong, J. Zhang, and T. Zhang, “Mllm-protector: Ensuring mllm’s safety without hurting performance,” arXiv preprint arXiv:2401.02906, 2024.
[49] Y. Zong, O. Bohdal, T. Yu, Y. Yang, and T. Hospedales, “Safety fine-tuning at (almost) no cost: A baseline for vision large language models,” arXiv preprint arXiv:2402.02207, 2024.
[50] Y. Zhang, L. Chen, G. Zheng, Y. Gao, R. Zheng, J. Fu, Z. Yin, S. Jin, Y. Qiao, X. Huang et al., “Spa-vl: A comprehensive safety preference alignment dataset for vision language model,” arXiv preprint arXiv:2406.12030, 2024.
[51] Z. Liu, Y. Nie, Y. Tan, X. Yue, Q. Cui, C. Wang, X. Zhu, and B. Zheng, “Safety alignment for vision language models,” arXiv preprint arXiv:2405.13581, 2024.
[52] H. Tu, C. Cui, Z. Wang, Y. Zhou, B. Zhao, J. Han, W. Zhou, H. Yao, and C. Xie, “How many unicorns are in this image? a safety evaluation benchmark for vision llms,” arXiv preprint arXiv:2311.16101, 2023.
[53] X. Liu, Y. Zhu, Y. Lan, C. Yang, and Y. Qiao, “Query-relevant images jailbreak large multi-modal models,” arXiv preprint arXiv:2311.17600, 2023.
[54] W. Luo, S. Ma, X. Liu, X. Guo, and C. Xiao, “Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks,” arXiv preprint arXiv:2404.03027, 2024.
[55] Y. Zhang, Y. Huang, Y. Sun, C. Liu, Z. Zhao, Z. Fang, Y. Wang, H. Chen, X. Yang, X. Wei et al., “Benchmarking trustworthiness of multimodal large language models: A comprehensive study,” arXiv preprint arXiv:2406.07057, 2024.
[56] T. Gu, Z. Zhou, K. Huang, D. Liang, Y. Wang, H. Zhao, Y. Yao, X. Qiao, K. Wang, Y. Yang et al., “Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models,” arXiv preprint arXiv:2406.07594, 2024.
[57] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, p. 9, 2019.
[58] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
[59] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/
[60] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
[61] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
[62] Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv preprint arXiv:2305.16355, 2023.
[63] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y. Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024.
[64] Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” in CVPR, 2024, pp. 24 185–24 198.
[65] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
[66] J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” in CVPR, 2024, pp. 26 689–26 699.
[67] R. Pi, J. Zhang, J. Zhang, R. Pan, Z. Chen, and T. Zhang, “Image textualization: An automatic framework for creating accurate and detailed image descriptions,” arXiv preprint arXiv:2406.07502, 2024.
[68] R. Pi, J. Zhang, T. Han, J. Zhang, R. Pan, and T. Zhang, “Personalized visual instruction tuning,” arXiv preprint arXiv:2410.07113, 2024.
[69] I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,” CL, pp. 1–79, 2024.
[70] X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang, “” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” in ACM CCS, 2024, pp. 1671–1685.
[71] S. Yi, Y. Liu, Z. Sun, T. Cong, X. He, J. Song, K. Xu, and Q. Li, “Jailbreak attacks and defenses against large language models: A survey,” arXiv preprint arXiv:2407.04295, 2024.
[72] Y. Guo, F. Jiao, L. Nie, and M. Kankanhalli, “The vllm safety paradox: Dual ease in jailbreak attack and defense,” arXiv preprint arXiv:2411.08410, 2024.
[73] G. Pantazopoulos, A. Parekh, M. Nikandrou, and A. Suglia, “Learning to see but forgetting to follow: Visual instruction tuning makes llms more prone to jailbreak attacks,” arXiv preprint arXiv:2405.04403, 2024.
[74] S. Bachu, E. Shayegani, T. Chakraborty, R. Lal, A. Dutta, C. Song, Y. Dong, N. Abu-Ghazaleh, and A. K. Roy-Chowdhury, “Unfair alignment: Examining safety alignment across vision encoder layers in vision-language models,” arXiv preprint arXiv:2411.04291, 2024.
[75] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay, “A survey on adversarial attacks and defences,” CAAI TIT, pp. 25–45, 2021.
[76] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel, “Adversarial attacks on neural network policies,” arXiv preprint arXiv:1702.02284, 2017.
[77] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay, “Adversarial attacks and defences: A survey,” arXiv preprint arXiv:1810.00069, 2018.
[78] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
[79] A. Madry, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
[80] S. Cheng, Y. Dong, T. Pang, H. Su, and J. Zhu, “Improving black-box adversarial attacks with a transfer-based prior,” NeurIPS, vol. 32, 2019.
[81] A. Demontis, M. Melis, M. Pintor, M. Jagielski, B. Biggio, A. Oprea, C. Nita-Rotaru, and F. Roli, “Why do adversarial attacks transfer? explaining transferability of evasion and poisoning attacks,” in USENIX, 2019, pp. 321–338.
[82] Z. Qin, Y. Fan, Y. Liu, L. Shen, Y. Zhang, J. Wang, and B. Wu, “Boosting the transferability of adversarial attacks with reverse adversarial perturbation,” NeurIPS, vol. 35, pp. 29 845–29 858, 2022.
[83] F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” arXiv preprint arXiv:2211.09527, 2022.
[84] D. Yao, J. Zhang, I. G. Harris, and M. Carlsson, “Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models,” in ICASSP. IEEE, 2024, pp. 4485–4489.
[85] E. Shayegani, Y. Dong, and N. Abu-Ghazaleh, “Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models,” in ICLR, 2023.
[86] V. Tolpegin, S. Truex, M. E. Gursoy, and L. Liu, “Data poisoning attacks against federated learning systems,” in ESORICs. Springer, 2020, pp. 480–501.
[87] A. Saha, A. Subramanya, and H. Pirsiavash, “Hidden trigger backdoor attacks,” in AAAI, 2020, pp. 11 957–11 965.
[88] Y. Li, T. Zhai, B. Wu, Y. Jiang, Z. Li, and S. Xia, “Rethinking the trigger of backdoor attack,” arXiv preprint arXiv:2004.04692, 2020.
[89] Y. Zeng, W. Park, Z. M. Mao, and R. Jia, “Rethinking the backdoor attacks’ triggers: A frequency perspective,” in ICCV, 2021, pp. 16 473–16 481.
[90] Y. Li, Y. Li, B. Wu, L. Li, R. He, and S. Lyu, “Invisible backdoor attack with sample-specific triggers,” in ICCV, 2021, pp. 16 463–16 472.
[91] E. Bagdasaryan, T.-Y. Hsieh, B. Nassi, and V. Shmatikov, “(ab) using images and sounds for indirect instruction injection in multi-modal llms,” arXiv preprint arXiv:2307.10490, 2023.
[92] K. Gao, Y. Bai, J. Gu, S.-T. Xia, P. Torr, Z. Li, and W. Liu, “Inducing high energy-latency of large vision-language models with verbose images,” arXiv preprint arXiv:2401.11170, 2024.
[93] D. Lu, T. Pang, C. Du, Q. Liu, X. Yang, and M. Lin, “Test-time backdoor attacks on multimodal large language models,” arXiv preprint arXiv:2402.08577, 2024.
[94] Z. Wang, Z. Han, S. Chen, F. Xue, Z. Ding, X. Xiao, V. Tresp, P. Torr, and J. Gu, “Stop reasoning! when multimodal llms with chain-of-thought reasoning meets adversarial images,” arXiv preprint arXiv:2402.14899, 2024.
[95] H. Luo, J. Gu, F. Liu, and P. Torr, “An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models,” arXiv preprint arXiv:2403.09766, 2024.
[96] K. Gao, Y. Bai, J. Bai, Y. Yang, and S.-T. Xia, “Adversarial robustness for visual grounding of multimodal large language models,” arXiv preprint arXiv:2405.09981, 2024.
[97] Z. Ying, A. Liu, T. Zhang, Z. Yu, S. Liang, X. Liu, and D. Tao, “Jailbreak vision language models via bi-modal adversarial prompt,” arXiv preprint arXiv:2406.04031, 2024.
[98] X. Yang, X. Tang, F. Zhu, J. Han, and S. Hu, “Enhancing cross-prompt transferability in vision-language models through contextual injection of target tokens,” arXiv preprint arXiv:2406.13294, 2024.
[99] J. Jang, H. Lyu, J. Koh, and H. J. Yang, “Replace-then-perturb: Targeted adversarial attacks with visual reasoning for vision-language models,” arXiv preprint arXiv:2411.00898, 2024.
[100] Y. Dong, H. Chen, J. Chen, Z. Fang, X. Yang, Y. Zhang, Y. Tian, H. Su, and J. Zhu, “How robust is google’s bard to adversarial image attacks?” arXiv preprint arXiv:2309.11751, 2023.
[101] Z. Niu, H. Ren, X. Gao, G. Hua, and R. Jin, “Jailbreaking attack against multimodal large language model,” arXiv preprint arXiv:2402.02309, 2024.
[102] X. Gu, X. Zheng, T. Pang, C. Du, Q. Liu, Y. Wang, J. Jiang, and M. Lin, “Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast,” arXiv preprint arXiv:2402.08567, 2024.
[103] Z. Tan, C. Zhao, R. Moraffah, Y. Li, Y. Kong, T. Chen, and H. Liu, “The wolf within: Covert injection of malice into mllm societies via an mllm operative,” arXiv preprint arXiv:2402.14859, 2024.
[104] M. Qraitem, N. Tasnim, P. Teterwak, K. Saenko, and B. A. Plummer, “Vision-llms can fool themselves with self-generated typographic attacks,” arXiv preprint arXiv:2402.00626, 2024.
[105] S. Ma, W. Luo, Y. Wang, X. Liu, M. Chen, B. Li, and C. Xiao, “Visual-roleplay: Universal jailbreak attack on multimodal large language models via role-playing image characte,” arXiv preprint arXiv:2405.20773, 2024.
[106] X. Zou, K. Li, and Y. Chen, “Image-to-text logic jailbreak: Your imagination can help you do anything,” arXiv preprint arXiv:2407.02534, 2024.
[107] R. Wang, B. Wang, X. Ma, and Y.-G. Jiang, “Ideator: Jailbreaking vlms using vlms,” arXiv preprint arXiv:2411.00827, 2024.
[108] Y. Wang, X. Zhou, Y. Wang, G. Zhang, and T. He, “Jailbreak large visual language models through multi-modal linkage,” arXiv preprint arXiv:2412.00473, 2024.
[109] M. Teng, J. Xiaojun, D. Ranjie, L. Xinfeng, H. Yihao, C. Zhixuan, L. Yang, and R. Wenqi, “Heuristic-induced multimodal risk distribution jailbreak attack for multimodal large language models,” arXiv preprint arXiv:2412.05934, 2024.
[110] A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y. Hanafy, W. Zhu, K. Marathe, Y. Bitton, S. Gadre, S. Sagawa et al., “Openflamingo: An open-source framework for training large autoregressive vision-language models,” arXiv preprint arXiv:2308.01390, 2023.
[111] I. Shumailov, Y. Zhao, D. Bates, N. Papernot, R. Mullins, and R. Anderson, “Sponge examples: Energy-latency attacks on neural networks,” in EuroS&P. IEEE, 2021, pp. 212–231.
[112] S. Chen, Z. Song, M. Haque, C. Liu, and W. Yang, “Nicgslowdown: Evaluating the efficiency robustness of neural image caption generation models,” in CVPR, 2022, pp. 15 365–15 374.
[113] P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” NeurIPS, pp. 2507–2521, 2022.
[114] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,” arXiv preprint arXiv:2302.00923, 2023.
[115] L. He, Z. Li, X. Cai, and P. Wang, “Multi-modal latent space learning for chain-of-thought reasoning in language models,” in AAAI, 2024, pp. 18 180–18 187.
[116] Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei, “Kosmos-2: Grounding multimodal large language models to the world,” arXiv preprint arXiv:2306.14824, 2023.
[117] W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J. Zhou, Y. Qiao et al., “Visionllm: Large language model is also an open-ended decoder for vision-centric tasks,” NeurIPS, 2024.
[118] M. Li and L. Sigal, “Referring transformer: A one-step approach to multi-task visual grounding,” NeurIPS, pp. 19 652–19 664, 2021.
[119] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” NeurIPS, pp. 24 824–24 837, 2022.
[120] Z. Chu, J. Chen, Q. Chen, W. Yu, T. He, H. Wang, W. Peng, M. Liu, B. Qin, and T. Liu, “A survey of chain of thought reasoning: Advances, frontiers and future,” arXiv preprint arXiv:2309.15402, 2023.
[121] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML. PMLR, 2022, pp. 12 888–12 900.
[122] Y. Wang, C. Liu, Y. Qu, H. Cao, D. Jiang, and L. Xu, “Break the visual perception: Adversarial attacks targeting encoded visual tokens of large vision-language models,” in ACM MM, 2024, pp. 1072–1081.
[123] S. Bird, E. Klein, and E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
[124] J. Xu, M. D. Ma, F. Wang, C. Xiao, and M. Chen, “Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models,” arXiv preprint arXiv:2305.14710, 2023.
[125] J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan, X. Ren, and H. Jin, “Backdooring instruction-tuned large language models with virtual prompt injection,” in NAACL, 2024, pp. 6065–6086.
[126] X. Tao, S. Zhong, L. Li, Q. Liu, and L. Kong, “Imgtrojan: Jailbreaking vision-language models with one image,” arXiv preprint arXiv:2403.02910, 2024.
[127] S. Liang, J. Liang, T. Pang, C. Du, A. Liu, E.-C. Chang, and X. Cao, “Revisiting backdoor attacks against large vision-language models,” arXiv preprint arXiv:2406.18844, 2024.
[128] W. Lyu, L. Pang, T. Ma, H. Ling, and C. Chen, “Trojvlm: Backdoor attack against vision language models,” arXiv preprint arXiv:2409.19232, 2024.
[129] W. Lyu, J. Yao, S. Gupta, L. Pang, T. Sun, L. Yi, L. Hu, H. Ling, and C. Chen, “Backdooring vision-language models with out-of-distribution data,” arXiv preprint arXiv:2410.01264, 2024.
[130] J. Sun, C. Wang, J. Wang, Y. Zhang, and C. Xiao, “Safeguarding vision-language models against patched visual prompt injectors,” arXiv preprint arXiv:2405.10529, 2024.
[131] Y. Zhao, X. Zheng, L. Luo, Y. Li, X. Ma, and Y.-G. Jiang, “Bluesuffix: Reinforced blue teaming for vision-language models against jailbreak attacks,” arXiv preprint arXiv:2410.20971, 2024.
[132] S. Oh, Y. Jin, M. Sharma, D. Kim, E. Ma, G. Verma, and S. Kumar, “Uniguard: Towards universal safety guardrails for jailbreak attacks on multimodal large language models,” arXiv preprint arXiv:2411.01703, 2024.
[133] Q. Liu, C. Shang, L. Liu, N. Pappas, J. Ma, N. A. John, S. Doss, L. Marquez, M. Ballesteros, and Y. Benajiba, “Unraveling and mitigating safety alignment degradation of vision-language models,” arXiv preprint arXiv:2410.09047, 2024.
[134] H. Wang, G. Wang, and H. Zhang, “Steering away from harm: An adaptive approach to defending vision language model against jailbreaks,” arXiv preprint arXiv:2411.16721, 2024.
[135] S. S. Ghosal, S. Chakraborty, V. Singh, T. Guan, M. Wang, A. Beirami, F. Huang, A. Velasquez, D. Manocha, and A. S. Bedi, “Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment,” arXiv preprint arXiv:2411.18688, 2024.
[136] Y. Gou, K. Chen, Z. Liu, L. Hong, H. Xu, Z. Li, D.-Y. Yeung, J. T. Kwok, and Y. Zhang, “Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation,” in ECCV. Springer, 2024, pp. 388–404.
[137] S. Fares, K. Ziu, T. Aremu, N. Durasov, M. Takáč, P. Fua, K. Nandakumar, and I. Laptev, “Mirrorcheck: Efficient adversarial defense for vision-language models,” arXiv preprint arXiv:2406.09250, 2024.
[138] A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defending large language models against jailbreaking attacks,” arXiv preprint arXiv:2310.03684, 2023.
[139] Y. Xie, M. Fang, R. Pi, and N. Gong, “Gradsafe: Detecting jailbreak prompts for llms via safety-critical gradient analysis,” in ACL, 2024, pp. 507–518.
[140] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022, pp. 10 684–10 695.
[141] F. Bao, S. Nie, K. Xue, C. Li, S. Pu, Y. Wang, G. Yue, Y. Cao, H. Su, and J. Zhu, “One transformer fits all distributions in multi-modal diffusion at scale,” in ICML. PMLR, 2023, pp. 1692–1717.
[142] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023, pp. 3836–3847.
[143] Y. Huang, F. Zhu, J. Tang, P. Zhou, W. Lei, J. Lv, and T.-S. Chua, “Effective and efficient adversarial detection for vision-language models via a single vector,” arXiv preprint arXiv:2410.22888, 2024.
[144] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV. Springer, 2014, pp. 740–755.
[145] Y. Chen, K. Sikka, M. Cogswell, H. Ji, and A. Divakaran, “Dress: Instructing large vision-language models to align and interact with humans via natural language feedback,” in CVPR, 2024, pp. 14 239–14 250.
[146] L. Helff, F. Friedrich, M. Brack, K. Kersting, and P. Schramowski, “Llavaguard: Vlm-based safeguards for vision dataset curation and safety assessment,” arXiv preprint arXiv:2406.05113, 2024.
[147] D. L. Crone, S. Bode, C. Murawski, and S. M. Laham, “The socio-moral image database (smid): A novel stimulus set for the study of social, moral and affective processes,” PloS one, p. e0190954, 2018.
[148] C. Schlarmann, N. D. Singh, F. Croce, and M. Hein, “Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models,” arXiv preprint arXiv:2402.12336, 2024.
[149] J. Li, Q. Wei, C. Zhang, G. Qi, M. Du, Y. Chen, and S. Bi, “Single image unlearning: Efficient machine unlearning in multimodal large language models,” arXiv preprint arXiv:2405.12523, 2024.
[150] T. Chakraborty, E. Shayegani, Z. Cai, N. Abu-Ghazaleh, M. S. Asif, Y. Dong, A. K. Roy-Chowdhury, and C. Song, “Cross-modal safety alignment: Is textual unlearning all you need?” arXiv preprint arXiv:2406.02575, 2024.
[151] M. Z. Hossain and A. Imteaj, “Sim-clip: Unsupervised siamese adversarial fine-tuning for robust and semantically-rich vision-language models,” arXiv preprint arXiv:2407.14971, 2024.
[152] ——, “Securing vision-language models with a robust encoder against jailbreak and adversarial attacks,” arXiv preprint arXiv:2409.07353, 2024.
[153] Y. Chen, H. Li, Z. Zheng, and Y. Song, “Bathe: Defense against the jailbreak attack in multimodal large language models by treating harmful instruction as backdoor trigger,” arXiv preprint arXiv:2408.09093, 2024.
[154] Z. Ying, A. Liu, S. Liang, L. Huang, J. Guo, W. Zhou, X. Liu, and D. Tao, “Safebench: A safety evaluation framework for multimodal large language models,” arXiv preprint arXiv:2410.18927, 2024.
[155] H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y. Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine et al., “Llama guard: Llm-based input-output safeguard for human-ai conversations,” arXiv preprint arXiv:2312.06674, 2023.
[156] H. Zhang, W. Shao, H. Liu, Y. Ma, P. Luo, Y. Qiao, and K. Zhang, “Avibench: Towards evaluating the robustness of large vision-language model on adversarial visual-instructions,” arXiv preprint arXiv:2403.09346, 2024.
[157] H. Cheng, E. Xiao, and R. Xu, “Typographic attacks in large multimodal models can be alleviated by more informative prompts,” arXiv preprint arXiv:2402.19150, 2024.
[158] M. Li, L. Li, Y. Yin, M. Ahmed, Z. Liu, and Q. Liu, “Red teaming visual language models,” arXiv preprint arXiv:2401.12915, 2024.
[159] S. Chen, Z. Han, B. He, Z. Ding, W. Yu, P. Torr, V. Tresp, and J. Gu, “Red teaming gpt-4v: Are gpt-4v safe against uni/multi-modal jailbreak attacks?” arXiv preprint arXiv:2404.03411, 2024.
[160] X. Li, H. Zhou, R. Wang, T. Zhou, M. Cheng, and C.-J. Hsieh, “Mossbench: Is your multimodal language model oversensitive to safe queries?” arXiv preprint arXiv:2406.17806, 2024.
[161] Y. Liu, C. Cai, X. Zhang, X. Yuan, and C. Wang, “Arondight: Red teaming large vision language models with auto-generated multi-modal jailbreak prompts,” in ACM MM, 2024, pp. 3578–3586.
[162] K. Zhou, C. Liu, X. Zhao, A. Compalas, D. Song, and X. E. Wang, “Multimodal situational safety,” arXiv preprint arXiv:2410.06172, 2024.
[163] F. Weng, Y. Xu, C. Fu, and W. Wang, “Mmj-bench: A comprehensive study on jailbreak attacks and defenses for multimodal large language models,” arXiv preprint arXiv:2408.08464, 2024.
[164] M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li et al., “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,” arXiv preprint arXiv:2402.04249, 2024.
[165] J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, B. Chen, R. Sun, Y. Wang, and Y. Yang, “Beavertails: Towards improved safety alignment of llm via a human-preference dataset,” NeurIPS, vol. 36, 2024.
[166] Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. G. H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li, “Trustworthy llms: A survey and guideline for evaluating large language models’ alignment,” arXiv preprint arXiv:2308.05374, 2023.
[167] U. Anwar, A. Saparov, J. Rando, D. Paleka, M. Turpin, P. Hase, E. S. Lubana, E. Jenner, S. Casper, O. Sourbut et al., “Foundational challenges in assuring alignment and safety of large language models,” arXiv preprint arXiv:2404.09932, 2024.
[168] S. Wang, X. Ye, Q. Cheng, J. Duan, S. Li, J. Fu, X. Qiu, and X. Huang, “Cross-modality safety alignment,” arXiv preprint arXiv:2406.15279, 2024.
[169] X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-pro: Unified multimodal understanding and generation with data and model scaling,” arXiv preprint arXiv:2501.17811, 2025.
[170] J. Chen, D. Zhu, X. Shen, X. Li, Z. Liu, P. Zhang, R. Krishnamoorthi, V. Chandra, Y. Xiong, and M. Elhoseiny, “Minigpt-v2: large language model as a unified interface for vision-language multi-task learning,” arXiv preprint arXiv:2310.09478, 2023.
[171] T. Kaufmann, P. Weng, V. Bengs, and E. Hüllermeier, “A survey of reinforcement learning from human feedback,” arXiv preprint arXiv:2312.14925, 2023.

A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations《大型视觉语言模型安全综述：攻击、防御与评估》