H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs
H-Neurons:关于 LLMs 中与幻觉相关的神经元的存在、影响和起源
Abstract 摘要
Large language models (LLMs) frequently generate hallucinations – plausible but factually incorrect outputs – undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
大型语言模型(LLMs)经常产生幻觉——看似合理但事实错误的结果——从而削弱了它们的可靠性。尽管先前研究已从训练数据和目标等宏观视角考察了幻觉,但底层的神经元级机制仍基本未被探索。在本文中,我们从三个视角系统地研究了 LLMs 中的幻觉相关神经元(H-Neurons):识别、行为影响和起源。关于它们的识别,我们证明了一个极少数的稀疏神经元子集(少于总神经元 )可以可靠地预测幻觉的发生,并在各种不同场景中具有强大的泛化能力。在行为影响方面,受控干预显示这些神经元与过度合规行为存在因果关系。关于它们的起源,我们将这些神经元追溯到预训练的基模型,并发现这些神经元在幻觉检测中仍然具有预测能力,表明它们在预训练过程中出现。 我们的研究将宏观行为模式与微观神经机制联系起来,为开发更可靠的 LLMs 提供见解。
1 Introduction 1 引言
In recent years, large language models (LLMs) have achieved groundbreaking advancements in natural language processing tasks, demonstrating impressive potential towards artificial general intelligence (foundation-model; gpt3; instruct-gpt; gpt4). However, these advancements come with a persistent reliability challenge that troubles researchers and users alike: hallucinations. Hallucinations occur when models produce outputs that seem plausible but are factually inaccurate or unsupported by evidence (OnFaithfulnessandFactuality; hallusurvey). For example, GPT-3.5 has been shown to hallucinate in approximately 40% of citation-based factuality evaluations, a figure that improves but remains high at 28.6% for GPT-4 (Hallucination_Rates). Similarly, emerging reasoning-centric systems such as DeepSeek-R1, despite demonstrating strong performance on complex tasks, continue to exhibit pronounced hallucination modes (vectara_deepseek_r1_hallucination). Collectively, these observations indicate that hallucinations persist regardless of model architecture, highlighting a critical bottleneck in the reliability of state-of-the-art LLMs.
近年来,大型语言模型(LLMs)在自然语言处理任务中取得了突破性进展,展现出向人工通用智能(基础模型;gpt3;instruct-gpt;gpt4)发展的巨大潜力。然而,这些进展伴随着一个持续存在的可靠性挑战,困扰着研究人员和用户:幻觉。幻觉是指模型产生看似合理但实际上不准确或缺乏证据支持(OnFaithfulnessandFactuality;hallusurvey)的输出。例如,GPT-3.5 在基于引用的事实性评估中约 40%的情况下出现幻觉,这一数字有所改善但仍然很高,在 GPT-4 中为 28.6%(Hallucination_Rates)。类似地,尽管以推理为中心的新兴系统如 DeepSeek-R1 在复杂任务上表现出色,但它们仍然表现出明显的幻觉模式(vectara_deepseek_r1_hallucination)。综合来看,这些观察表明幻觉与模型架构无关,突显了当前最先进 LLMs 可靠性方面的关键瓶颈。
To improve LLM reliability, researchers have invested considerable effort in uncovering the mechanisms and factors behind hallucinations, which can be broadly grouped into three categories. First, from a training data perspective, distribution imbalances and inherent biases within datasets make it difficult for models to accurately recall long-tail facts (DBLP:conf/naacl/SunXZLD24; DBLP:conf/acl/LiLSDSLJJL22).
Second, training objectives in both pretraining and post-training phases primarily incentivize confident predictions without promoting the expression of uncertainty for unfamiliar information, encouraging models to output incorrect guesses (DBLP:journals/corr/abs-2509-04664). Specifically, the next-token prediction goal in pretraining prioritizes fluent continuations over factual accuracy, while instruction tuning or reinforcement learning often favors generating superficially helpful responses, sometimes at the expense of honest refusals to answer.
Third, decoding algorithms introduce instability through randomness and error accumulation in autoregressive generation, allowing small deviations to snowball into hallucinations (DBLP:conf/icml/ZhangPMLS24; DBLP:conf/nips/LeePXPFSC22; DBLP:conf/nips/KapoorGRCPBWDGW24).
为提高 LLM 的可靠性,研究人员投入了大量精力来揭示产生幻觉的机制和因素,这些因素可以大致分为三类。首先,从训练数据的角度来看,数据集中的分布不平衡和固有偏见使得模型难以准确回忆长尾事实(DBLP:conf/naacl/SunXZLD24;DBLP:conf/acl/LiLSDSLJJL22)。其次,预训练和后训练阶段中的训练目标主要激励模型做出自信的预测,而不会促进对不熟悉信息的表达不确定性,从而鼓励模型输出错误的猜测(DBLP:journals/corr/abs-2509-04664)。具体来说,预训练中的下一词预测目标优先考虑流畅的延续性而非事实准确性,而指令微调或强化学习通常倾向于生成表面上有帮助的响应,有时甚至以不诚实地拒绝回答为代价。 第三,解码算法通过自回归生成中的随机性和误差累积引入不稳定性,使得小的偏差会滚雪球般发展成幻觉(DBLP:conf/icml/ZhangPMLS24;DBLP:conf/nips/LeePXPFSC22;DBLP:conf/nips/KapoorGRCPBWDGW24)。
Current studies largely treat LLMs as black boxes, examining hallucination causes at a macroscopic level while neglecting microscopic insights into neuron-level mechanisms. Yet, such fine-grained analysis holds immense promise for explaining how hallucinations arise and for developing mitigation strategies.
Just as biological research on cellular division informs treatments for diseases such as cancer (collins1997cell; matthews2022cell), and neuroscience investigations into individual neuronal activity and synaptic interactions shape theories of cognition like learning (DBLP:journals/natmi/LuczakMK22) and memory (mongillo2008synaptic; lisman2018memory), analyzing neurons – the fundamental computational units of LLMs – is essential for decoding hallucination. By scrutinizing neurons’ activation patterns in relation to hallucinations, we can gain deeper insights into model reliability. In terms of interpretability, neuron-level analysis can enable the prediction of when hallucinations are prone to emerge; for alignment and behavioral control, it provides actionable intervention points, such as activating or suppressing specific subsets of neurons to reliably modify model outputs.
当前研究大多将 LLMs 视为黑箱,在宏观层面考察幻觉成因,而忽视了神经元层面的微观机制洞察。然而,这种细粒度分析对于解释幻觉如何产生以及制定缓解策略具有巨大潜力。正如细胞分裂的生物学研究为癌症等疾病的治疗提供了指导(collins1997cell; matthews2022cell),以及神经科学对单个神经元活动和突触交互的研究塑造了学习(DBLP:journals/natmi/LuczakMK22)和记忆(mongillo2008synaptic; lisman2018memory)等认知理论一样,分析神经元——LLMs 的基本计算单元——对于解码幻觉至关重要。通过审视神经元与幻觉相关的激活模式,我们可以更深入地了解模型可靠性。在可解释性方面,神经元层面的分析能够预测幻觉易发时机;在对齐和行为控制方面,它提供了可操作的干预点,例如通过激活或抑制特定神经元的子集来可靠地修改模型输出。
In this paper, we adopt a neuron-centric perspective to investigate the microscopic mechanisms of hallucinations in LLMs. Prior research has shown that internal hidden states can serve as effective features for detecting hallucinations (Internal_States), and others using sparse autoencoders have provided case studies connecting hallucinations to specific neuron activations (anthropicSAE; SAEhallu), hinting at a deeper link between neuronal behavior and hallucination generation.
Building on this foundation, we identify a set of hallucination-associated neurons and term them as H-Neurons. We then systematically explore the existence, behavioral impacts, and origins of H-Neurons. We address the following three research questions:
在这篇论文中,我们采用以神经元为中心的视角来研究 LLMs 中幻觉的微观机制。先前研究表明,内部隐藏状态可以作为检测幻觉的有效特征(Internal_States),而使用稀疏自编码器的研究者提供了案例研究,将幻觉与特定神经元激活联系起来(anthropicSAE;SAEhallu),暗示了神经元行为与幻觉生成之间存在更深层次的联系。基于这一基础,我们识别出一组与幻觉相关的神经元,并将它们称为 H-Neurons。然后,我们系统地探索 H-Neurons 的存在、行为影响及其起源。我们提出以下三个研究问题:
-
•
Q1: Do H-Neurons exist? Can we identify specific neurons whose activations reliably distinguish between hallucinatory and faithful outputs?
• Q1:H-Neurons 是否存在?我们能否识别出那些激活能够可靠地区分幻觉输出和忠实输出的特定神经元? -
•
Q2: How do these neurons influence model behavior? Specifically, what types of tasks exhibit a significant change on performance when these neurons’ activations are altered, thereby establishing a link between hallucination and another capability?
• Q2:这些神经元如何影响模型行为?具体来说,当这些神经元的激活发生变化时,哪些类型的任务在性能上表现出显著变化,从而建立幻觉与另一种能力之间的联系? -
•
Q3: When do these neurons originate? Are they introduced during the post-training alignment phase or already present in the pre-trained phase?
• Q3: 这些神经元何时产生?它们是在后训练对齐阶段引入的,还是已经存在于预训练阶段?
Specifically, drawing from setups in previous work (Finding_Safety_Neurons; Finding_Skill_Neurons; Detecting_hallu), we focus on neurons in the feedforward networks and examine hallucinations in knowledge-based question answering and make the following observations.
具体来说,借鉴先前工作中的设置(Finding_Safety_Neurons;Finding_Skill_Neurons;Detecting_hallu),我们关注前馈网络中的神经元,并检查基于知识的问答中的幻觉,得出以下观察结果。
Existence of H-Neurons Our investigation reveals that a remarkably sparse subset of neurons – comprising less than of the model’s total neurons – can accurately predict whether the model will produce hallucinated responses. We refer to these predictive neurons as H-Neurons.
To identify these neurons, we develop a systematic methodology that contrasts activation patterns between faithful and hallucinated responses, then apply sparse logistic regression to uncover the most predictive neurons. Notably, the neurons identified through simple QA tasks demonstrate strong generalization capability: they maintain robust predictive accuracy across out-of-distribution scenarios, ranging from specialized cross-domain contexts to pure fabrications concerning non-existent entities, achieving reliable hallucination detection.
H-神经元的发现我们的研究揭示,模型中一个显著稀疏的神经元子集——仅占模型总神经元数量的不到 ——能够准确预测模型是否会产生幻觉响应。我们将这些预测性神经元称为 H-神经元。为了识别这些神经元,我们开发了一种系统方法,通过对比忠实响应和幻觉响应的激活模式,然后应用稀疏逻辑回归来揭示最具有预测性的神经元。值得注意的是,通过简单问答任务识别出的神经元表现出强大的泛化能力:它们在分布外场景中保持稳健的预测准确性,涵盖从专业跨域上下文到关于不存在实体的纯粹虚构,实现了可靠的幻觉检测。
Impact on Model Behavior Our analysis uncovers that H-Neurons are linked to over-compliance behaviors in LLMs. To establish this causal relationship, we conduct controlled interventions by systematically scaling the activation magnitudes of these neurons. The interventions reveal a distinctive behavioral pattern: amplifying H-Neurons’ activations systematically increases a spectrum of over-compliance behaviors – ranging from overcommitment to incorrect premises and heightened susceptibility to misleading contexts, to increased adherence to harmful instructions and stronger sycophantic tendencies. These findings suggest that H-Neurons do not simply encode factual errors, but rather represent a general tendency to prioritize conversational compliance over factual integrity.
对模型行为的影响我们的分析揭示,H-Neurons 与 LLMs 中的过度合规行为相关。为了建立这种因果关系,我们通过系统地调整这些神经元的激活幅度进行受控干预。干预结果显示出一种独特的行为模式:系统地增强 H-Neurons 的激活会系统地增加一系列过度合规行为——从过度承诺到错误的前提,以及更容易受到误导性语境的影响,再到更严格地遵守有害指令和更强的奉承倾向。这些发现表明,H-Neurons 不仅仅是编码事实性错误,而是代表了一种优先考虑对话合规性而非事实完整性的普遍倾向。
Origin of H-Neurons Our investigation reveals that H-Neurons originate during the pre-training phase, providing empirical evidence for the insights proposed by OpenAI researchers from the perspective of learning theory (DBLP:journals/corr/abs-2509-04664). To trace their developmental timeline, we conduct cross-model transfer experiments: we apply the hallucination neurons identified in instruction-tuned models to their corresponding base models and evaluate their predictive efficacy. The results demonstrate that these neurons retain their predictive ability in base models, successfully detecting hallucinations even prior to fine-tuning.
H-神经元的起源我们的研究揭示,H-神经元在预训练阶段产生,从学习理论的角度为 OpenAI 研究人员提出的见解提供了实证证据(DBLP:journals/corr/abs-2509-04664)。为了追踪它们的发展时间线,我们进行了跨模型迁移实验:我们将指令微调模型中识别出的幻觉神经元应用于相应的基模型,并评估它们的预测效能。结果表明,这些神经元在基模型中仍保持其预测能力,甚至在微调之前就能成功检测到幻觉。
In summary, this paper provides a systematic neuron-level investigation into the microscopic mechanisms of hallucinations in LLMs. By bridging the gap between macroscopic behavioral patterns and fine-grained neural activations, we hope our work can deepen the understanding of how hallucinations arise at the computational level, and offer actionable insights for developing more reliable LLMs.
总之,本文对 LLMs 中幻觉的微观机制进行了系统性的神经元层面研究。通过连接宏观行为模式与细粒度神经激活之间的差距,我们希望这项工作能够加深对幻觉在计算层面产生方式的理解,并为开发更可靠的 LLMs 提供可行的见解。
2 Identification of H-Neurons
2H-神经元的识别
While prior work has demonstrated that internal hidden states can detect hallucinations (Internal_States; anthropicSAE; SAEhallu), a systematic investigation into hallucination-associated neurons remains absent. In this section, we address our first research question: Do H-Neurons exist? We hypothesize that among the millions of neurons in modern LLMs, a sparse subset exhibits activation patterns that systematically distinguish between hallucinatory and faithful outputs. The sparse subset of neurons could serve as both interpretable indicators for detection and precise intervention points for further research.
尽管已有研究证明内部隐藏状态可以检测幻觉(Internal_States;anthropicSAE;SAEhallu),但针对幻觉相关神经元的系统性研究仍然缺失。在本节中,我们探讨第一个研究问题:H-神经元是否存在?我们假设在现代 LLMs 中数百万个神经元中,一个稀疏的子集表现出能够系统地区分幻觉输出和忠实输出的激活模式。这个稀疏的子集既可作为检测的可解释指标,也可作为进一步研究的精确干预点。
图 1:识别 H-Neurons 的框架。(a) 在前馈网络中,我们计算每个神经元在一次前向传播中的贡献。该指标将单个神经元的投影输出 的幅度归一化到层的总输出向量 ,从而提供一个标准化的衡量其贡献到隐藏状态的指标。(b) 该过程始于使用 TriviaQA 基准生成一个包含忠实(绿色对勾)和幻觉(红色叉号)响应的平衡数据集。我们提取特定在答案标记上的神经元的贡献特征来训练一个线性分类器。该分类器分配正权重的神经元被识别为"H-Neurons",根据它们在生成幻觉中的预测作用将它们与普通神经元区分开来。
To identify H-Neurons from the vast parameter space of LLMs, we employ a sparse linear probing approach (Figure 1). We first quantify each neuron’s contribution to the responses using the CETT metric (relu2wins), which is used to measure the neuron’s activation level during generation.
We then frame hallucination detection as a binary classification problem: predicting whether a response is hallucinatory based on neuron activations. Using logistic regression with L1 regularization, we train a sparse classifier that automatically selects the most predictive neurons by driving most weights to zero.
The neurons with non-zero weights are identified as H-Neurons. Training data is collected from TriviaQA (Triviaqa) by sampling multiple responses per question and labeling them based on factual correctness. To present the effectiveness of H-Neurons, we establish a baseline by training linear classifiers using randomly selected neurons. To ensure the fairness of the comparison, the number of randomly selected neurons is the same as that of H-Neurons.
为了从 LLMs 庞大的参数空间中识别 H-Neurons,我们采用稀疏线性探测方法(图 1)。我们首先使用 CETT 指标(relu2wins)量化每个神经元对响应的贡献,该指标用于测量生成过程中神经元的激活程度。然后我们将幻觉检测构建为一个二元分类问题:根据神经元激活情况预测响应是否为幻觉。使用 L1 正则化的逻辑回归,我们训练了一个稀疏分类器,该分类器通过将大部分权重驱动为零来自动选择最具有预测性的神经元。具有非零权重的神经元被识别为 H-Neurons。训练数据从 TriviaQA(Triviaqa)中收集,通过每道题采样多个响应并根据事实正确性进行标记。为了展示 H-Neurons 的有效性,我们通过使用随机选择的神经元训练线性分类器来建立基线。为确保比较的公平性,随机选择的神经元数量与 H-Neurons 的数量相同。
表 1:基于神经元的分类器的幻觉检测准确率( )。我们评估了基于神经元的分类器在六个广泛使用的 LLMs 上的性能。这里,“随机”和“幻觉”指的是使用随机选择的神经元和 H-Neurons 训练的分类器。比率是指用于分类器的总神经元比例。具有 H-Neurons 的分类器可以有效地检测领域内问题(TriviaQA 和 NQ)、跨领域问题(BioASQ)和虚构问题,展示了 H-Neurons 的鲁棒性。H-Neurons 通常占 LLMs 中所有神经元的比例不到 1‰。
| Models 模型 | Neurons 神经元 | Ratio (‰) 比率(‰) | TriviaQA | NQ-Open | BioASQ | NonExist |
| Mistral-7B-v0.3 | Random 随机 | 0.35 | 61.7 | 56.1 | 59.4 | 80.9 |
| Hallucination 幻觉 | 0.35 | 78.4 | 71.5 | 75.5 | 91.1 | |
| Mistral-Small-3.1-24B | Random 随机 | 0.01 | 61.1 | 56.8 | 52.8 | 57.4 |
| Hallucination 幻觉 | 0.01 | 81.0 | 71.3 | 69.5 | 86.6 | |
| Gemma-3-4B | Random 随机 | 0.10 | 62.0 | 59.7 | 56.0 | 56.9 |
| Hallucination 幻觉 | 0.10 | 76.9 | 70.7 | 71.0 | 71.9 | |
| Gemma-3-27B | Random 随机 | 0.18 | 65.2 | 58.5 | 61.8 | 58.2 |
| Hallucination 幻觉 | 0.18 | 83.6 | 68.6 | 72.0 | 95.9 | |
| Llama-3.1-8B | Random 随机 | 0.02 | 56.1 | 53.0 | 52.9 | 50.6 |
| Hallucination 幻觉 | 0.02 | 70.1 | 63.3 | 66.0 | 43.1 | |
| Llama-3.3-70B | Random 随机 | 0.01 | 68.4 | 58.9 | 66.9 | 69.6 |
| Hallucination 幻觉 | 0.01 | 82.7 | 67.2 | 74.3 | 96.7 |
To assess whether the identified neurons generalize beyond the training set and reflect broader patterns of hallucination, we evaluate the trained linear model for hallucination detection on diverse question collections.
We design a comprehensive evaluation protocol covering three distinct hallucination scenarios:
(1) In-Domain Knowledge Recall: We evaluate on TriviaQA and NQ (nq), both constructed from Wikipedia, a corpus extensively used in LLM pretraining. These datasets test whether hallucination neurons can detect failures in recalling familiar but unmemorized knowledge.
(2) Cross-Domain Robustness: We evaluate on BioASQ (bioasq), a biomedical question-answering dataset. Since our classifier is trained exclusively on TriviaQA with general knowledge, BioASQ tests cross-domain generalization to specialized domains with distinct terminology and factual structures.
(3) Fabricated Knowledge Detection: We construct a dataset, referred to as NonExist, containing artificially generated questions about non-existent entities (e.g., "Who manufactures the medicine volor pri octacap?" where "volor pri octacap" is fabricated) (hallulens). When models provide confident answers to such questions, it constitutes a clear hallucination. This scenario tests whether hallucination neurons can detect fabrication, generating plausible-sounding answers about facts absent from any training data.
Together, these settings provide comprehensive coverage: from recall failures on seen knowledge, to domain transfer, to complete fabrication, enabling assessment of the generality and robustness of H-Neurons.
为了评估所识别的神经元是否能够泛化到训练集之外,并反映更广泛的幻觉模式,我们评估了用于幻觉检测的训练线性模型在多种问题集合上的表现。我们设计了一个全面的评估协议,涵盖三种不同的幻觉场景:(1) 域内知识召回:我们在 TriviaQA 和 NQ(nq)上评估,这两个数据集均由维基百科构建,维基百科是用于 LLM 预训练的语料库。这些数据集测试幻觉神经元是否能够检测到对熟悉但未记忆知识的召回失败。(2) 跨域鲁棒性:我们在 BioASQ(bioasq)上评估,这是一个生物医学问答数据集。由于我们的分类器仅在 TriviaQA 上用通用知识进行训练,BioASQ 测试了跨域泛化到具有不同术语和事实结构的专门领域的能力。(3) 伪造知识检测:我们构建了一个数据集,称为 NonExist,其中包含关于不存在实体的人工生成问题(例如,“谁制造了药物 volor pri octacap?”其中“volor pri octacap”是伪造的)(hallulens)。 当模型对这类问题给出自信的答案时,这就构成了明显的幻觉。这种情况测试了幻觉神经元能否检测到虚构,从而生成听起来合理但实际上不存在于任何训练数据中的答案。这些设置共同提供了全面的覆盖范围:从对已见知识的回忆失败,到领域迁移,再到完全的虚构,从而能够评估 H-神经元的普遍性和鲁棒性。
Table 1 presents the hallucination detection performance of neuron-based classifiers across six widely-used LLMs. The results demonstrate that H-Neurons exhibit remarkable robustness in detecting hallucinations. First, classifiers built on H-Neurons consistently and substantially outperform those using randomly selected neurons across all models and evaluation settings, with accuracy improvements often exceeding percentage points. Second, these classifiers demonstrate remarkable robustness across diverse scenarios: they achieve high accuracy on in-domain datasets (TriviaQA and NQ), exhibit strong generalization on cross-domain biomedical questions (BioASQ), and retain effectiveness on fabricated questions (NonExist). The consistent performance across familiar knowledge recall, domain transfer, and complete fabrication scenarios indicates that H-Neurons capture generalizable patterns of hallucinations rather than dataset-specific artifacts.
表 1 展示了基于神经元的分类器在六个广泛使用的 LLMs 上的幻觉检测性能。结果表明,H-Neurons 在检测幻觉方面表现出显著的鲁棒性。首先,基于 H-Neurons 构建的分类器在所有模型和评估设置中始终且大幅优于使用随机选择神经元的分类器,准确率提升通常超过 个百分点。其次,这些分类器在不同场景中表现出显著的鲁棒性:它们在领域内数据集(TriviaQA 和 NQ)上实现高准确率,在跨领域生物医学问题(BioASQ)上表现出强泛化能力,并在虚构问题(NonExist)上保持有效性。在熟悉知识回忆、领域迁移和完全虚构场景中的持续性能表明,H-Neurons 捕捉到了幻觉的可泛化模式,而非特定数据集的伪影。
Remarkably, H-Neurons constitute an extremely sparse subset of the model’s total neurons. These neurons typically account for less than 1‰ of all neurons in the models – ranging from 0.01‰ in large models like Mistral-Small-3.1-24B and Llama-3.1-70B to 0.35‰ in Mistral-7B-v0.3. Despite their scarcity, this small set of neurons provides sufficient signal to reliably detect hallucination, demonstrating that a compact subset of model parameters carries substantial information about hallucination tendencies.
值得注意的是,H-神经元构成了模型总神经元中一个极其稀疏的子集。这些神经元通常占模型所有神经元的比例不到 1‰——在 Mistral-Small-3.1-24B 和 Llama-3.1-70B 等大型模型中占比为 0.01‰,而在 Mistral-7B-v0.3 中占比为 0.35‰。尽管它们数量稀少,但这小部分神经元提供了足够的信号来可靠地检测幻觉,证明模型的紧凑参数子集承载着大量关于幻觉倾向的信息。
3 Behaviour Impact of H-Neurons
3 H-神经元的行为影响
图 2:干预 H-神经元的行为影响示意图。右图展示了四个维度的示例:无效前提(关于不存在“猫毛”的幻觉细节)、误导性上下文下的遵从(采纳关于玛丽·居里的反事实陈述)、怀疑态度(在质疑时放弃正确答案)和有害指令(绕过安全过滤器协助武器制造)。
Having established the existence of H-Neurons and their predictive ability in Section 2, a natural question arises: What functional role do these neurons play in shaping model behavior? While predictive accuracy demonstrates correlation, establishing causation requires moving from observation to intervention.
In this section, we conduct controlled perturbation experiments to determine whether artificially modulating these neurons leads to systematic and interpretable changes in model outputs, and whether such changes reveal a broader behavioral pattern that extends beyond factual errors.
在第二部分中,我们已证实 H-Neurons 的存在及其预测能力,一个自然的问题随之产生:这些神经元在塑造模型行为中扮演着怎样的功能角色?虽然预测准确性显示了相关性,但要建立因果关系则需要从观察转向干预。在本部分,我们进行受控扰动实验,以确定人为调节这些神经元是否会导致模型输出的系统性且可解释的变化,以及这些变化是否揭示了一种超越事实性错误的更广泛的行为模式。
To probe the causal impact of H-Neurons, we design a systematic perturbation methodology that modulates their contributions during inference without retraining the model. Following the identification procedure, we focus on neurons with positive weights in the hallucination detection classifier, as their activation exhibits a positive correlation with hallucinatory responses.
Our intervention operates by scaling the activation values of these neurons during forward passes: for each target neuron, we multiply its activation by a scaling factor , where suppresses the neuron’s influence by reducing its activation strength, preserves the original behavior, and amplifies its contribution to responses by increasing activation magnitude.
This approach enables a direct assessment of whether modulating the influence of H-Neurons induces systematic behavioral changes, and whether such changes align with the semantic or safety risks associated with hallucination.
为了探究 H-Neurons 的因果影响,我们设计了一种系统扰动方法,在不重新训练模型的情况下调节它们在推理过程中的贡献。在识别程序之后,我们关注幻觉检测分类器中具有正权重的神经元,因为它们的激活与幻觉响应呈正相关。我们的干预通过在前向传递过程中调整这些神经元的激活值来操作:对于每个目标神经元,我们将其激活值乘以一个缩放因子 ,其中 通过降低激活强度来抑制神经元的 影响, 保持原始行为,而 通过增加激活幅度来放大其对响应的贡献。这种方法能够直接评估调节 H-Neurons 的影响是否会导致系统性行为变化,以及这些变化是否与幻觉相关的语义或安全风险相一致。
图 3:扰动 LLMs 的合规率(%)。在以下合规性基准测试中,当抑制(缩放因子 )或放大(缩放因子 )H-Neurons 时,性能会发生变化:(a) 无效前提,(b) 指导性误导,(c) 怀疑态度和(d) 有害指令。此处,合规率特别测量为 FalseQA 上无效前提的接受率、FaithEval 上的准确率、Sycophancy 上同意错误反馈的比率以及 Jailbreak 上产生有害响应的比率。分数越低表示减少过度合规并提高模型鲁棒性。随着缩放因子的增加,四个维度上的合规率普遍上升,表明 H-Neurons 因果控制过度合规行为。
A prevailing hypothesis in the literature attributes hallucinations to models’ tendency to venture risky guesses in pursuit of higher accuracy (DBLP:conf/stoc/KalaiV24; DBLP:conf/nips/CohenKRF24; DBLP:journals/corr/abs-2509-04664). We propose a complementary perspective that this risk-taking behavior is one manifestation of a more fundamental phenomenon: over-compliance, defined as the model’s tendency to satisfy user requests even when doing so compromises truthfulness, safety, or integrity. For example, when a model generates hallucinated content to answer an unanswerable question, it is prioritizing the implicit human expectation of receiving an answer over the admission of uncertainty or knowledge boundaries-analogous to how humans may lie due to social desirability (wholies; lalwani2006relation).
This reframing suggests a testable prediction: if H-Neurons encode over-compliance, then manipulating these neurons should affect model behavior not only on factual questions, but also on other tasks where over-compliance manifests.
文献中一个普遍的假设将幻觉归因于模型追求更高准确率而倾向于冒险猜测的倾向(DBLP:conf/stoc/KalaiV24;DBLP:conf/nips/CohenKRF24;DBLP:journals/corr/abs-2509-04664)。我们提出一个补充视角,认为这种冒险行为是更根本现象的一种表现:过度遵从,定义为模型在牺牲真实性、安全性或完整性时仍倾向于满足用户请求的倾向。例如,当模型为回答一个无法回答的问题而生成幻觉内容时,它优先考虑了人类隐含的期望——即获得答案,而不是承认不确定性或知识边界——类似于人类可能因社会期望而撒谎(wholies;lalwani2006relation)。这种重新定义提出了一个可检验的预测:如果 H-神经元编码了过度遵从,那么操纵这些神经元应该不仅会影响模型在事实性问题上的行为,也会影响在其他任务中表现出过度遵从的行为。
To test this hypothesis systematically, we evaluate the modified model across four carefully selected benchmarks, each probing a different facet of over-compliance (Figure 2):
(1) FalseQA (FalseQA) assesses compliance with invalid premises, probing whether models attempt to answer questions built on factually incorrect assumptions rather than rejecting the flawed premise.
(2) FaithEval (Faitheval) examines compliance with misleading contexts, evaluating whether models uncritically accept and follow potentially incorrect information provided in prompts rather than questioning or verifying it.
(3) Sycophancy (Sycophancy) measures compliance with skeptical attitudes, quantifying the tendency to echo user opinions or revise correct answers when users express disagreement rather than maintaining epistemic integrity.
(4) Jailbreak (Jailbreak) tests compliance with harmful instructions, measuring whether models inappropriately satisfy instructions that violate safety guidelines.
Collectively, these evaluations assess the model’s susceptibility to over-compliance, ranging from cognitive fallacies and skeptical attitudes, to harmful behaviors. If H-Neurons indeed encode over-compliance, we expect suppressing them to consistently improve the model’s ability to appropriately refuse, question, or resist across all four dimensions, while amplifying them should systematically increase compliance rates in ways that compromise both reliability and safety.
为系统性地验证这一假设,我们通过四个精心选择的基准测试来评估改进后的模型,每个基准测试都针对过度合规的不同方面(图 2):(1)FalseQA(FalseQA)评估对无效前提的合规性,探究模型是否会基于事实性错误的前提来回答问题,而不是拒绝有缺陷的前提。(2)FaithEval(Faitheval)检验对误导性上下文的合规性,评估模型是否会不加批判地接受并遵循提示中可能不正确的信息,而不是质疑或核实它。(3)Sycophancy(Sycophancy)衡量对怀疑态度的合规性,量化模型附和用户意见或在用户表达不同意见时修正正确答案的倾向,而不是保持知识上的正直。(4)Jailbreak(Jailbreak)测试对有害指令的合规性,衡量模型是否不恰当地满足违反安全指南的指令。综合来看,这些评估评估了模型对过度合规的易感性,从认知谬误和怀疑态度,到有害行为。 如果 H-神经元确实编码过度服从,我们预期抑制它们将始终提高模型在四个维度上适当拒绝、质疑或抵抗的能力,而增强它们则应系统地提高服从率,从而损害可靠性和安全性。
Figure 3 illustrates the relationship between the scaling factor of H-Neurons and the model’s compliance rate. Overall, we observe that: (1) There is a consistent positive correlation between the scaling factor of neurons and model’s compliance rate. This phenomenon is observed across four different evaluation dimensions. This indicates that artificially amplifying the activation values of these H-Neurons significantly compromises the model’s resistance to false premises, misleading contexts, skeptical attitudes or harmful instructions whereas suppressing them effectively reduces over-compliance behaviors, effectively restoring the model’s robustness and integrity.
(2) The susceptibility of models to perturbation on neurons generally exhibits an inverse correlation with parameter size. The three smaller models exhibit a steeper average growth in compliance rates (average slope ) across the evaluated dimensions, whereas the three larger models maintain a more moderate average growth (average slope ). This suggests that smaller models are more prone to drastic behavioral shifts under internal perturbation, while larger models likely possess greater intrinsic robustness that mitigates the impact of amplifying specific neuron groups.
(3) The behavioral response is not strictly monotonic for all cases. In tasks such as FalseQA and Jailbreak, certain models exhibit fluctuations or temporary drops in compliance at intermediate scaling factors. This is likely due to complex internal mechanisms: since we linearly amplify the neurons (), this strong intervention might push the model’s internal features out-of-distribution at certain points, unexpectedly decreasing compliance. A notable instance is observed in the Sycophancy task, where the smallest model, Gemma-3-4B, initially exhibits increased compliance that subsequently declines as the scaling factor increases.
图 3 展示了 H-Neurons 的缩放因子与模型合规率之间的关系。总体而言,我们观察到以下几点:(1)神经元的缩放因子与模型的合规率之间存在一致的正相关关系。这种现象在四个不同的评估维度中均有体现。这表明人为放大这些 H-Neurons 的激活值会显著削弱模型对虚假前提、误导性上下文、怀疑态度或有害指令的抵抗能力,而抑制它们则能有效减少过度合规行为,从而有效恢复模型的鲁棒性和完整性。(2)模型对神经元扰动的敏感性通常与参数规模呈负相关。三个较小模型的合规率在评估维度上表现出更陡峭的平均增长(平均斜率 ),而三个较大模型则保持更适度的平均增长(平均斜率 )。 这表明较小模型更容易在内部扰动下出现剧烈的行为变化,而较大模型可能拥有更强的内在鲁棒性,能够减轻放大特定神经元组的影响。(3) 对于所有情况,行为响应并不严格单调。在 FalseQA 和 Jailbreak 等任务中,某些模型在中等规模因子下表现出合规性的波动或暂时性下降。这可能是由于复杂的内部机制:由于我们线性放大神经元( ),这种强烈的干预可能会在某些点上使模型的内部特征偏离分布,意外地降低合规性。一个值得注意的例子是在 Sycophancy 任务中,最小的模型 Gemma-3-4B 最初表现出合规性增加,但随着规模因子的增加,合规性随后下降。
4 Origin of H-Neurons 4 H-神经元的起源
Having established the existence and explored the behavioral impact of H-Neurons, we now investigate their origins: Do these neurons emerge during pre-training, or are they artifacts of post-training alignment? Determining this timeline is crucial, as it dictates whether mitigation efforts should focus on the pre-training process or alignment algorithms.
If H-Neurons already show distinct activation patterns in the base model, this would suggest that hallucination behavior has roots in pre-training representations rather than purely SFT-induced alignment dynamics.
在确认了 H-Neurons 的存在并探讨了其行为影响后,我们现在研究它们的起源:这些神经元是在预训练过程中产生的,还是后训练对齐的产物?确定这一时间线至关重要,因为它决定了缓解措施应侧重于预训练过程还是对齐算法。如果 H-Neurons 在基础模型中已经显示出独特的激活模式,这表明幻觉行为根源于预训练表示,而非纯粹由 SFT 诱导的对齐动态。
To answer this, we conduct two complementary analyses. First, we examine the backward transferability of H-Neurons. We hypothesize that if these neurons originate during pre-training, the detection probes trained on instruction-tuned models should remain effective on their corresponding base models. We apply the classifiers trained on instruction-tuned models (Section 2) directly to the base models. This allows us to evaluate whether the same neuron subset preserves its predictive ability across models. However, since activation magnitudes often shift significantly from pre-training to fine-tuning, using the same fixed classification threshold as in Section 2 is unreliable. Instead, we utilize the Area Under the Receiver Operating Characteristic Curve (AUROC) as our primary metric.
Second, we study how instruction tuning changes these neurons to determine whether the alignment process actively constructs or merely preserves the circuits responsible for hallucination. To quantify the modifications induced by SFT, we compute the cosine distance of both its up-projection and down-projection weights between the base and aligned models analyze the rank distribution of H-Neurons within the global parameter space. This comparative ranking allows us to determine whether the alignment process modifies these specific neurons more significantly than the average neuron in the network.
为了回答这个问题,我们进行了两项互补的分析。首先,我们考察了 H-Neurons 的逆向迁移能力。我们假设,如果这些神经元在预训练过程中产生,那么在指令微调模型上训练的检测探针应该仍然对其对应的基模型有效。我们将第 2 节中训练的分类器直接应用于基模型。这使我们能够评估相同的神经元子集是否能在不同模型中保持其预测能力。然而,由于激活幅度从预训练到微调时往往发生显著变化,使用第 2 节中相同的固定分类阈值是不可靠的。相反,我们使用接收者操作特征曲线下面积(AUROC)作为主要指标。其次,我们研究指令微调如何改变这些神经元,以确定对齐过程是主动构建还是仅仅保留了导致幻觉的电路。 为了量化 SFT 引起的修改,我们计算了其对齐模型和基础模型之间上投影和下投影权重的余弦距离,分析 H-神经元在全局参数空间中的秩分布。这种比较排名使我们能够确定对齐过程是否比网络中的平均神经元更显著地修改了这些特定神经元。
图 4:(a) 在指令微调模型上训练的分类器直接应用于其对应的基础模型时的 AUROC 分数。所有模型均显著优于随机基线。这种稳健的迁移能力证实了幻觉的神经特征是预训练阶段的内在属性。(b) H-神经元相似度排名分布。每个子图展示了 H-神经元的归一化排名位置(0-1 尺度),较小的归一化排名值对应于从预训练到对齐时参数变化较大。黑色虚线表示平均排名,彩色圆圈代表 H-神经元。通过单边 t 检验验证了 H-神经元与其他神经元相比具有更高余弦相似度的统计显著性。在大多数模型中,H-神经元始终集中在高归一化排名区域,表明这些神经元主要继承自预训练,并未被 SFT 引入或大幅修改。
Figure 4 presents the performance of hallucination detection and parameter evolution. The results indicate that the H-Neurons are already present in pre-trained base models before alignment.
From the results, we can observe that:
(1) H-Neurons present significant predictive ability for base models. Across all six models and three datasets, the AUROC scores consistently surpass the random guessing baseline by a large margin. Notably, the Mistral family achieves accuracy exceeding 86% on TriviaQA. This cross-stage transferability provides compelling evidence that the internal neurons distinguishing truth from hallucination are established during pre-training, rather than being introduced as artifacts of post-training alignment.
(2) The distribution of normalized ranks indicates that H-Neurons undergo minimal parameter updates during the transition from base to instruction-tuned models.
This trend is particularly pronounced in Mistral-Small, where H-Neurons are heavily concentrated in the high-rank regions (), indicating exceptional parameter stability. Similarly, Gemma and Llama series models exhibit a statistically significant tendency toward stability (; ).
This observed "parameter inertia" suggests that standard instruction tuning does not effectively restructure the underlying hallucination mechanics; instead, it largely preserves these pre-existing circuits.
图 4 展示了幻觉检测和参数演化的性能。结果表明,H-Neurons 在对齐之前就已经存在于预训练的基座模型中。从结果中,我们可以观察到:(1) H-Neurons 对基座模型具有显著的预测能力。在所有六个模型和三个数据集上,AUROC 分数始终大幅超越了随机猜测的基线。值得注意的是,Mistral 系列在 TriviaQA 上达到了超过 86%的准确率。这种跨阶段的可迁移性提供了强有力的证据,表明区分真相与幻觉的内部神经元是在预训练阶段建立的,而不是作为后训练对齐的伪影引入的。(2) 归一化排名的分布表明,H-Neurons 在从基座模型过渡到指令微调模型的过程中参数更新极少。这一趋势在 Mistral-Small 中尤为明显,其中 H-Neurons 高度集中在高排名区域( ),表明参数稳定性极高。类似地,Gemma 和 Llama 系列模型也表现出统计学上显著的稳定性趋势( ; )。 这种观察到的"参数惰性"表明,标准的指令微调并没有有效地重构底层的幻觉机制;相反,它很大程度上保留了这些现有的电路。
5 Discussion 5 讨论
Our study establishes the correlation between neuron-level mechanisms and hallucinations for large language models. First, we demonstrate that hallucinations are reliably associated with a sparse subset of neurons in the FFN networks (Q1). Second, through targeted perturbation, we demonstrate that these neurons extend beyond hallucinations. They consistently promote behaviors such as over-compliance to invalid premises, misleading contexts, skeptical attitude, and harmful instructions, indicating that they encode a general disposition toward compliant answer generation (Q2). Third, our cross-model transfer experiments demonstrate that these neurons emerge during pre-training and persist through instruction tuning (Q3).
These findings open up promising directions for both practical applications and theoretical understanding of LLM behavior.
我们的研究建立了大语言模型中神经元级机制与幻觉之间的关联。首先,我们证明了幻觉与 FFN 网络中稀疏的神经元子集可靠相关(Q1)。其次,通过定向扰动,我们证明了这些神经元不仅与幻觉相关,还持续促进诸如对无效前提过度服从、误导性语境、怀疑态度和有害指令等行为,表明它们编码了一种倾向于服从性回答生成的一般倾向(Q2)。第三,我们的跨模型迁移实验表明,这些神经元在预训练期间出现,并在指令微调过程中持续存在(Q3)。这些发现为 LLM 行为的实际应用和理论理解开辟了有前景的方向。
Applications of H-Neurons. Our findings on H-Neurons can benefit practical applications in improving LLM trustworthiness. First, these neurons can enhance hallucination detection mechanisms. Our experiments demonstrate that H-Neurons generalize effectively across different models, domains, and hallucination types, suggesting that neuron-level signals could serve as robust features for training more effective hallucination detection systems. Moreover, neuron-level signals open new possibilities for token-level hallucination detection by enabling fine-grained identification of factual errors with specific parts of longer model responses.
H-Neurons 的应用。我们对 H-Neurons 的研究成果可以促进实际应用,以提升 LLM 的可信度。首先,这些神经元可以增强幻觉检测机制。我们的实验表明,H-Neurons 在不同模型、领域和幻觉类型中具有有效的泛化能力,这表明神经元级别的信号可以作为训练更有效幻觉检测系统的可靠特征。此外,神经元级别的信号为 token 级别的幻觉检测开辟了新的可能性,通过能够对较长模型响应中的特定部分进行细粒度的事实错误识别。
Second, our work provides a direction for hallucination mitigation through neuron-level interventions. While existing hallucination mitigation approaches focus on training strategies and knowledge augmentation (RAGsurvey; HallucinationMitigationSurvey), our findings suggest that targeted neuron editing could offer a more direct control mechanism. However, a critical challenge lies in balancing hallucination reduction with model helpfulness. Simple suppression or amplification of neuron activations proves insufficient for effective control. Future research must develop more sophisticated intervention strategies that can reliably suppress hallucinations while preserving the model’s overall utility and performance.
其次,我们的工作为通过神经元层面的干预来缓解幻觉提供了方向。虽然现有的幻觉缓解方法主要集中于训练策略和知识增强(RAG 调查;幻觉缓解调查),但我们的研究结果表明,有针对性的神经元编辑可能提供一种更直接的控制系统。然而,一个关键挑战在于平衡幻觉减少与模型的有用性。简单的抑制或增强神经元激活已被证明不足以实现有效控制。未来的研究必须开发更复杂的干预策略,这些策略能够在可靠抑制幻觉的同时,保持模型的整体效用和性能。
Origins and Mechanisms of Hallucinations. Our findings provide deeper neuronal-level insights into the causes of hallucinations in LLMs. We establish a critical link between H-Neurons and over-compliance behaviors, connecting two seemingly distinct phenomena. Prior work has shown that models often guess answers to achieve higher accuracy metrics (wei2025truthrl), a behavior that represents a form of over-compliance with task requirements. Our neuron-level analysis reveals the underlying computational mechanism: H-Neurons encode a general tendency toward generating compliant responses, even at the cost of factual accuracy. This finding offers a granular explanation for why models prioritize task completion over truthfulness.
幻觉的起源与机制。我们的研究为 LLMs 中幻觉的成因提供了更深层次的神经元水平见解。我们建立了 H-Neurons 与过度合规行为之间的关键联系,将两种看似不同的现象联系起来。先前研究表明,模型常常猜测答案以实现更高的准确率指标(wei2025truthrl),这种行为代表了任务要求的一种过度合规形式。我们的神经元水平分析揭示了其背后的计算机制:H-Neurons 编码了一种倾向于生成合规响应的普遍趋势,即使以牺牲事实准确性为代价。这一发现为为什么模型优先考虑任务完成而非真实性提供了细致的解释。
Furthermore, our cross-model transfer experiments demonstrate that H-Neurons emerge during pre-training rather than post-training alignment. We argue that this originates from the inherent characteristics of the next-token prediction objective. This training paradigm does not distinguish between factually correct and incorrect continuations – it merely rewards fluent text generation. Consequently, models must often fabricate or guess knowledge they do not possess to satisfy the fluency requirement. This observation aligns with recent theoretical analyses that demonstrate hallucinations are an inevitable consequence of the pre-training process from a learning-theoretic perspective (DBLP:journals/corr/abs-2509-04664). Together, these findings suggest that hallucinations are not merely artifacts of model scaling or alignment procedures, but rather deeply rooted in the fundamental training objectives that shape LLM behavior from their inception.
此外,我们的跨模型迁移实验表明,H-Neurons 是在预训练阶段而非后训练对齐过程中出现的。我们认为这源于下一词预测目标的固有特性。这种训练范式并不区分事实正确与错误的延续——它仅仅奖励流畅的文本生成。因此,模型必须经常编造或猜测它们不具备的知识以满足流畅性要求。这一观察与最近的理论分析相吻合,这些分析从学习理论的角度表明,幻觉是预训练过程的必然结果(DBLP:journals/corr/abs-2509-04664)。综合这些发现,表明幻觉不仅仅是模型扩展或对齐过程的产物,而是深深植根于塑造 LLM 行为的根本训练目标中,这些目标从其诞生之初就存在。
Our neuron-centric investigation reveals that hallucinations are rooted in the model’s computational architecture and training objectives. By linking H-Neurons to over-compliance behaviors and tracing their origins to pre-training, we provide both theoretical insights and practical pathways for improving LLM reliability through enhanced detection and targeted interventions.
我们的神经元中心研究揭示,幻觉根植于模型的计算架构和训练目标。通过将 H-神经元与过度合规行为联系起来,并追溯其起源至预训练阶段,我们为通过增强检测和针对性干预来提高 LLM 可靠性提供了理论见解和实践途径。
6 Methods 6 方法
To systematically deconstruct the neural mechanisms behind hallucination, we structure our methodology around three lines: identification, perturbation, and origin tracing.
为了系统地解析幻觉背后的神经机制,我们围绕三条主线构建了我们的方法论:识别、扰动和起源追踪。
First, addressing the existence of H-Neurons (Q1), we introduce an interpretable pipeline with a sparse linear classifier to isolate a precise subset of neurons that reliably signal hallucination.
Second, to determine how these neurons functionally shape model behavior (Q2), we move from observation to manipulation. Through targeted perturbation experiments, we test the hypothesis that these neurons drive a broader pattern of over-compliance, assessing their causal efficacy across diverse benchmarks of different aspects of over-compliance.
Finally, to uncover when these H-Neurons emerge (Q3), we quantify their backward transferability to pre-training and their parameter evolution during alignment.
首先,针对 H-神经元的存在性问题(Q1),我们引入一个可解释的流程,使用稀疏线性分类器来分离出那些可靠地发出幻觉信号的神经元精确子集。其次,为了确定这些神经元如何功能性地塑造模型行为(Q2),我们从观察转向操控。通过有针对性的扰动实验,我们检验这些神经元驱动更广泛过度服从模式的假设,并在不同方面的过度服从基准测试中评估它们的因果效应。最后,为了揭示这些 H-神经元何时出现(Q3),我们量化它们对预训练的逆向迁移能力及其在对齐过程中的参数演变。
Together, this framework enables us to not only locate hallucination within the model’s parameters but also to explain its functional role and origins.
该框架使我们不仅能够在模型的参数中定位幻觉,还能解释其功能作用和起源。
6.1 Identifying H-Neurons 6.1 识别 H-神经元
To investigate the neural mechanisms underlying hallucination generation, we design a systematic analysis pipeline to identify a subset of neurons that are more active on faithful outputs than hallucinatory ones. First, to isolate stable neural signatures from stochastic decoding noise, we establish a controlled contrastive dataset comprising an equal number of verified faithful responses and hallucinatory responses. Building on this balanced foundation, we then quantify the specific contribution of individual neurons to the generated tokens across all samples. Finally, these contribution profiles serve as inputs to train a linear classifier, where the learned weights provide a direct, quantitative metric for assessing each neuron’s role in driving the model toward hallucinatory behaviors.
为了研究产生幻觉的神经机制,我们设计了一个系统分析流程,以识别在可信输出上比在幻觉输出上更活跃的一组神经元。首先,为了从随机解码噪声中分离出稳定的神经特征,我们建立了一个受控对比数据集,其中包含相同数量的已验证可信响应和幻觉响应。基于这个平衡的基础,我们接着量化单个神经元对所有样本中生成的标记的具体贡献。最后,这些贡献特征作为输入来训练一个线性分类器,其中学习到的权重为评估每个神经元在推动模型产生幻觉行为中的角色提供了直接、定量的指标。
6.1.1 Training Data Construction
6.1.1 训练数据构建
To robustly identify neurons associated with hallucinations, we need to construct a dataset that yields stable and precise contrastive signals between faithful and hallucinatory outputs. To ensure stability, relying on individual response samples is inadequate, as a single output fails to verify whether the model’s behavior reflects a consistent internal belief or merely transient decoding noise. To ensure precision, indiscriminately analyzing the entire response sequence is suboptimal, as it dilutes the neural signal with non-factual syntactic fillers. Therefore, our data construction process is designed to minimize signal ambiguity by filtering for consistency and maximize precision by targeting specific answer tokens.
为了稳健地识别与幻觉相关的神经元,我们需要构建一个能够产生稳定且精确的对比信号的数据集,这些信号应体现在忠实输出与幻觉输出之间。为了确保稳定性,单纯依赖单个响应样本是不够的,因为单个输出无法验证模型的行为是否反映了持续稳定的内部信念,还是仅仅瞬时的解码噪声。为了确保精确性,不加区分地分析整个响应序列则不够理想,因为它会稀释神经信号,掺入非事实性的句法填充词。因此,我们的数据构建过程旨在通过筛选一致性来最小化信号模糊性,并通过针对特定的答案标记来最大化精确性。
Consistency Filtering. Our first goal is to capture the model’s stable behavioral patterns across multiple responses.
To achieve this, we adopt the TriviaQA dataset (Triviaqa) for its broad coverage of general-domain knowledge and typically concise answers, which align well with our requirements. For each query, we perform a rigorous consistency check by sampling 10 distinct responses using probabilistic decoding parameters (temperature=1.0, top_k=50, top_p=0.9).
一致性过滤。我们的首要目标是捕捉模型在多个响应中表现出的稳定行为模式。为此,我们采用 TriviaQA 数据集(Triviaqa),因其广泛涵盖通用领域知识且通常答案简洁,与我们的需求高度契合。对于每个查询,我们通过使用概率解码参数(温度=1.0,top_k=50,top_p=0.9)随机采样 10 个不同响应,进行严格的一致性检查。
We retain only those instances where the model exhibits consistent behavior:
(1) Consistently Correct: The model answers correctly in all 10 samples.
(2) Consistently Incorrect: The model fails in all 10 samples, consistently generating incorrect answers instead of responding with "I don’t know" or similar refusals.
This strict filtering yields a high-quality contrastive set of 1,000 fully correct and 1,000 fully incorrect examples. This ensures that any observed differences in neuronal activity are attributable to the fundamental truthfulness of the output rather than generation noise.
我们仅保留那些模型表现出一致行为的实例:(1) 持续正确:模型在所有 10 个样本中都回答正确。(2) 持续错误:模型在所有 10 个样本中都失败,持续生成错误答案,而不是回答“我不知道”或类似的拒绝。这种严格的筛选产生了一组高质量的对比集,包含 1000 个完全正确和 1000 个完全错误的例子。这确保了观察到的神经元活动差异归因于输出本身的基本真实性,而不是生成噪声。
Answer Token Extraction. Having established the samples, our second objective is to precisely localize the neural signal. Hallucinations in factual QA typically manifest within specific entities or key terms rather than in syntactic filler words (e.g., "The answer is…") (LLMsKnowMore). Treating non-factual tokens and answer tokens as the same in the analysis would introduce noise and dilute the signal of H-Neurons.
Consequently, we use GPT-4o to explicitly identify and align the specific spans of text containing the factual claim. By focusing on these token positions, we ensure that the detected activation patterns are directly linked to the factual content of the generation.
答案 token 提取。在确定样本后,我们的第二个目标是精确定位神经信号。在事实性问答中,幻觉通常体现在特定的实体或关键词中,而不是在句法填充词(例如,“答案是...”)中(LLMsKnowMore)。在分析中将非事实性 token 和答案 token 视为相同会引入噪声并稀释 H-Neurons 的信号。因此,我们使用 GPT-4o 来明确识别和定位包含事实性陈述的特定文本片段。通过关注这些 token 位置,我们确保检测到的激活模式直接与生成的内容的事实性相关联。
6.1.2 Quantifying Neuron Contribution
6.1.2 量化神经元贡献
With the dataset established, our next objective is to transform these raw text samples into quantitative neural contributions that can serve as inputs for training a linear classifier. Specifically, we need to measure the functional influence of every neuron on each response to identify which specific units sway the model toward hallucination. Simply recording raw activation magnitudes is insufficient for this purpose, as a neuron might exhibit high activation yet have a negligible impact on the hidden state representation of FFN due to downstream projection weights. Therefore, we adopt the CETT metric (relu2wins) to quantify the contribution of an individual neuron to the hidden state representation during the forward pass. This metric transforms raw neural activity into a measure of causal efficacy, serving as the fundamental feature input for our subsequent linear classifier.
数据集建立后,我们的下一个目标是将这些原始文本样本转化为可用于训练线性分类器的定量神经贡献。具体来说,我们需要测量每个神经元对每个响应的功能影响,以识别哪些特定单元使模型倾向于产生幻觉。仅记录原始激活幅度不足以实现这一目的,因为一个神经元可能表现出高激活,但由于下游投影权重,其对 FFN 的隐藏状态表示的影响可能微乎其微。因此,我们采用 CETT 指标(relu2wins)来量化单个神经元在正向传播过程中对隐藏状态表示的贡献。该指标将原始神经活动转化为因果效应的度量,作为我们后续线性分类器的基本特征输入。
Estimating Token-Level Contribution. Consider an input sequence processed by a transformer block. At token position , the hidden representation is . Within each MLP, is first projected into an intermediate activation space:
估计 Token 级别的贡献。考虑一个输入序列 被一个 Transformer 模块处理。在 Token 位置 ,隐藏表示是 。在每个 MLP 中, 首先被投影到一个中间激活空间:
| (1) |
where denotes the non-linear activation, and are learned projection matrices. Each dimension corresponds to the activation of neuron prior to the down-projection , with .
表示非线性激活, 是学习到的投影矩阵。每个维度 对应于神经元 在下投影 之前的激活,与 。
To isolate the contribution of a single neuron , we mask all other neurons, defining the single-neuron activation vector ,
where is the -th standard basis vector so retains only the -th component of and zeros out all others. The down-projected partial hidden vector attributable to neuron is then .
为了隔离单个神经元 的贡献,我们屏蔽所有其他神经元,定义单个神经元激活向量 ,其中 是 - 维标准基向量,因此 仅保留 的分量并使其他分量归零。归因于神经元 的下投影部分隐藏向量是 。
We then measure the normalized contribution of neuron at position as the magnitude of its projected vector relative to the total hidden state norm:
然后我们测量位置 的神经元 的归一化贡献,作为其投影向量的幅度相对于总隐藏状态范数:
| (2) |
Intuitively, this ratio captures the fraction of the information flow at token that is explicitly attributable to neuron
直观上,这个比率捕获了标记 的信息流中明确归因于神经元 的部分
Aggregating Features for Hallucination Detection. While Eq. (2) provides a token-level metric, directly utilizing the full sequence of contribution scores as input features is impractical and unsuited for our objective as including every token would introduce excessive noise and computational overhead. Furthermore, we hypothesize that neurons driving hallucinations are specifically active during the generation of the answer tokens, whereas activity during syntactic fillers represents general linguistic processing.
用于幻觉检测的特征聚合。虽然公式(2)提供了一个词元级别的度量,但直接将完整的贡献分数序列作为输入特征是不切实际且不适合我们目标的,因为包含每个词元将引入过多的噪声和计算开销。此外,我们假设驱动幻觉的神经元在生成答案词元时特别活跃,而句法填充词元时的活动则代表一般语言处理。
Consequently, to distill the most relevant signals and ensure training efficiency, we aggregate the token-level scores into two fixed-dimensional features for neuron j on each sample:
因此,为了提取最相关的信号并确保训练效率,我们将词元级别的分数在每个样本上对神经元 j 聚合为两个固定维度的特征:
| (3) |
where denotes the set of answer tokens and denotes other tokens.
Here, serves as the primary signal for potential hallucinatory behavior, while acts as a control baseline which enables the subsequent classifier to filter out neurons that are merely active across the entire sequence and isolate those that are selectively influential specifically during the generation of the answer tokens where hallucinations manifest.
其中 表示答案词元的集合, 表示其他词元。这里, 作为潜在幻觉行为的信号,而 作为控制基线,它使后续分类器能够过滤掉在整个序列中仅活跃的神经元,并分离出在生成答案词元时特别活跃且幻觉显现时具有选择性影响的神经元。
6.1.3 Identifying H-Neurons via Linear Classifier
6.1.3 通过线性分类器识别 H-神经元
Having quantified the contribution of each neuron, our final step is to pinpoint the specific subset of neurons associated with hallucination. We achieve this by training a linear classifier that accepts the contribution of all neurons as input to predict a binary label indicating whether the response is a hallucination. The learned weights of this classifier then serve as a direct quantitative metric to assess each neuron’s role in model’s hallucination. With this classifier, our objective is to identify a precise subset of neurons: the selected set must be comprehensive enough to capture the full signal driving hallucinations, yet sufficiently sparse to exclude neurons responsible for other capabilities.
量化了每个神经元的贡献后,我们的最终步骤是确定与幻觉相关的特定神经元子集。我们通过训练一个线性分类器来实现这一点,该分类器将所有神经元的贡献作为输入,以预测一个二元标签,指示响应是否为幻觉。该分类器学习到的权重直接作为量化指标,用于评估每个神经元在模型幻觉中的作用。通过这个分类器,我们的目标是识别一个精确的神经元子集:所选集合必须足够全面,以捕捉驱动幻觉的完整信号,但又足够稀疏,以排除负责其他能力的神经元。
Feature Construction. To train a classifier that targets only hallucination, we must construct a training set that enforces strict specificity.
For each response , we assemble the per-neuron aggregated scores into two feature vectors: , which contains for all neurons , and , which contains the corresponding non-answer contributions.
特征构建。为了训练一个仅针对幻觉的分类器,我们必须构建一个强制严格特异性的训练集。对于每个响应 ,我们将每个神经元的聚合分数组装成两个特征向量: ,其中包含所有神经元 的 ,以及 ,其中包含相应的非答案贡献。
We then assign binary labels to these vectors based on a rigorous exclusion criterion. We define the positive class () exclusively as the answer-token features from hallucinatory responses. All other cases are assigned to the negative class (): (1) Faithful Answer Tokens: To prevent the classifier from selecting neurons that activate for any factual claim. (2) Non-Answer Tokens from both faithful and hallucinatory responses: To prevent selecting neurons associated with general generation quality or syntax.
Formally, the label assignment for the feature vectors is defined as:
然后我们根据严格的排除标准,为这些向量分配二进制标签 。我们将正类( )严格定义为来自幻觉响应的答案标记特征。所有其他情况都分配到负类( ):(1)忠实答案标记:防止分类器选择对任何事实声明都激活的神经元。(2)来自忠实和幻觉响应的非答案标记:防止选择与一般生成质量或句法相关的神经元。形式上,特征向量的标签分配定义为:
This asymmetric labeling strategy forces the classifier to identify neurons that are active specifically when the model is generating an answer and specifically when that answer is false.
这种非对称标签策略迫使分类器识别那些仅在模型生成答案时被激活,并且仅当该答案为错误时被激活的神经元。
Sparse Linear Classifier. We model the probability of a hallucination as , where represents the learned importance weight of each neuron.
Crucially, we employ -regularized logistic regression rather than a dense or non-linear model. The choice of a linear model ensures that the learned weights are directly interpretable as the marginal contribution of each neuron to the hallucination log-odds. The penalty enforces sparsity, as we hypothesize that hallucinations are driven by a sparse subset of neurons rather than the entire network. By imposing a strong regularization, this also helps highlight the critical contributions of this specific subset.
稀疏线性分类器。我们将幻觉的概率建模为 ,其中 表示每个神经元的学习重要性权重。关键在于,我们采用 正则化的逻辑回归,而不是密集或非线性模型。选择线性模型确保学习到的权重 可以直接解释为每个神经元对幻觉对数优势的边际贡献。 惩罚强制执行稀疏性,因为我们假设幻觉是由网络中稀疏的子集神经元驱动的,而不是整个网络。通过施加强正则化,这也有助于突出显示这个特定子集的关键贡献。
The training objective minimizes the negative log-likelihood with the sparsity constraint:
训练目标在稀疏约束下最小化负对数似然:
| (4) |
where the sum ranges over all constructed examples .
其中,求和范围覆盖所有构建的示例 。
Evaluation Protocol. To assess the predictive power and generalization capability of this classifier, we evaluate it under more challenging settings than the training phase.
评估协议。为了评估该分类器的预测能力和泛化能力,我们在比训练阶段更具挑战性的设置下对其进行评估。
First, we expand the scope beyond the training source to include two out-of-distribution datasets: NQ-Open (nq) and BioASQ (bioasq). Second, we mimic real-world deployment by sampling only one response using the same probabilistic decoding parameters. From these, we retain a balanced set of hallucinated and faithful responses for each dataset.
首先,我们将范围扩展到训练数据源之外,包括两个分布外数据集:NQ-Open(nq)和 BioASQ(bioasq)。其次,我们通过使用相同的概率解码参数仅采样一个响应来模拟真实世界的部署。从这些数据中,我们为每个数据集保留一个平衡的幻觉和忠实响应集。
Unlike training, where non-answer tokens served as negative controls, during evaluation we extract only the aggregated contribution vector of the answer span and compute the hallucination probability .
与训练阶段不同,在训练中非答案标记用作负控,在评估期间我们仅提取答案跨度 的聚合贡献向量,并计算幻觉概率 。
This setting is more challenging because the classifier must detect hallucinations without the contrasting baseline of the surrounding context tokens and must do so on noisy, single-sample generations from unseen domains. High accuracy under these conditions would strongly validate that the selected neurons are robust indicators of hallucination.
这种设置更具挑战性,因为分类器必须在缺乏周围上下文标记的对比基线的情况下检测幻觉,并且必须在来自未见过的领域的噪声、单样本生成上进行。在这些条件下获得高精度将有力证明所选神经元是幻觉的稳健指标。
Balancing Detection Recall and Functional Safety. In Eq. (4), the regularization parameter or its inverse acts as the critical control knob for the scope of the identified neurons.
Selecting an appropriate is a delicate trade-off. On one hand, setting too low enforces aggressive sparsity, which risks excluding too many H-Neurons. Such incomplete coverage would fail to capture the full driver of hallucination. On the other hand, setting too high introduces noise by including neurons essential for general language modeling, thereby causing damage to the model’s fundamental capabilities during intervention.
平衡检测召回率和功能安全性。在公式(4)中,正则化参数 或其倒数 作为识别神经元范围的关键控制旋钮。选择合适的 是一个微妙的权衡。一方面,设置 过低会强制执行激进的稀疏性,这会风险排除过多的 H-神经元。这种不完整的覆盖将无法捕捉到幻觉的完整驱动因素。另一方面,设置 过高会通过包含对通用语言建模必不可少的神经元而引入噪声,从而在干预期间损害模型的基本能力。
To navigate this trade-off, we perform a grid search to select to maximize the sum of (1) classification accuracy on a held-out set and (2) model performance on TriviaQA when suppressing the identified H-Neurons. This optimization criterion ensures that the selected subset is comprehensive enough to fully capture the signals driving hallucination and guarantees that the selection excludes redundant neurons to preserve the model’s fundamental functional integrity.
为了应对这一权衡,我们进行网格搜索来选择 ,以最大化(1)在保留集上的分类精度和(2)在抑制识别的 H-神经元时 TriviaQA 上的模型性能。这一优化标准确保所选子集足够全面,能够完全捕捉驱动幻觉的信号,并保证选择排除冗余神经元,以保持模型的基本功能完整性。
Through this optimization, we identify a sparse vector where only a small fraction of neurons (typically ) have positive weights . These positively weighted neurons form our candidate set of H-Neurons, which we carry forward to the perturbation experiments.
通过这次优化,我们识别出一个稀疏向量 ,其中只有一小部分神经元(通常为 )具有正权重 。这些具有正权重的神经元构成了我们的 H-神经元候选集,我们将这些神经元用于后续的扰动实验。
6.2 Perturbation Experiments
6.2 扰动实验
While the linear probing analysis in Section 6.1 establishes a strong predictive correlation between specific neurons and hallucinatory outputs, establishing causation requires moving from observation to intervention. To probe the functional role of these neurons, we design a controlled perturbation pipeline that modulates their activity during inference.
尽管第 6.1 节的线性探测分析表明特定神经元与幻觉输出之间存在强烈的预测相关性,但要建立因果关系需要从观察转向干预。为了探究这些神经元的职能作用,我们设计了一个受控扰动流程,在推理过程中调节它们的活动。
We hypothesize that the neurons identifying hallucinations do not merely encode factual errors, but rather drive a fundamental behavioral we term over-compliance, which means the model’s tendency to satisfy user prompts even at the expense of truthfulness, safety, or integrity. Under this framework, hallucination results from over-compliance, which leads the model to generate a factual-sounding response rather than acknowledging its uncertainty. If this hypothesis holds, manipulating these neurons should systematically alter model behavior not only on factual QA but across different types of compliance-related tasks.
Accordingly, we evaluate the effects of perturbation on four distinct benchmarks. Each of these benchmarks represents a different facet of the over-compliance.
我们假设识别幻觉的神经元不仅编码事实性错误,而是驱动一种我们称之为过度顺从的基本行为,这意味着模型在牺牲真实性、安全性或完整性时仍倾向于满足用户提示。在此框架下,幻觉是由过度顺从引起的,这导致模型生成听起来像事实的回应,而不是承认其不确定性。如果这一假设成立,操控这些神经元应该系统地改变模型行为,不仅限于事实性问答,还涵盖不同类型的顺从相关任务。因此,我们评估了扰动对四个不同基准的影响。这些基准中的每一个都代表了过度顺从的不同方面。
6.2.1 Activation Scaling 6.2.1 激活缩放
To causally verify this hypothesis, we require a method to precisely modulate the influence of the identified neurons without retraining the model. We employ inference-time activation scaling, modifying the activation of a target neuron during the forward pass by a scalar :
为了因果验证这一假设,我们需要一种在不重新训练模型的情况下精确调节已识别神经元影响的方法。我们采用推理时激活缩放,通过一个标量 在前向传递中修改目标神经元 的激活值:
| (5) |
Here, suppresses the neuron’s influence, maintains the original behavior, and amplifies its contribution.
在这里, 抑制神经元的影响, 保持原始行为,而 放大其贡献。
Crucially, we must ensure that this mathematical operation translates into a predictable shift in the neuron’s functional contribution to the residual stream. Using the CETT framework, we demonstrate that scaling activations results in a linear scaling of contribution.
关键在于,我们必须确保这一数学操作能够转化为神经元对残差流的贡献的预测性变化。使用 CETT 框架,我们证明了缩放激活会导致贡献的线性缩放。
Recall from Equation 2 that the contribution of neuron at token is the ratio of its projected magnitude to the total hidden state norm: . Under perturbation, the modified activation becomes , leading to the perturbed hidden vector . The perturbed full hidden state is given by . The resulting CETT value under perturbation is:
回想公式 2,神经元 在标记 上的贡献是其投影幅度与总隐藏状态范数的比值: 。在扰动下,修改后的激活变为 ,导致扰动后的隐藏向量 。扰动的完整隐藏状态由 给出。扰动下的 CETT 值如下:
| (6) |
In LLMs with thousands of neurons in a layer, is much smaller than since the contribution of any single neuron is typically infinitesimal compared to the aggregate hidden state. Consequently, the perturbation term in the denominator has a negligible impact on the overall norm. We can therefore approximate the denominator as , yielding:
在具有一层数千个神经元的 LLMs 中, 远小于 ,因为单个神经元的贡献通常与总隐藏状态相比微不足道。因此,分母中的扰动项 对整体范数的影响可以忽略不计。我们可以将分母近似为 ,从而得到:
| (7) |
This derivation provides the theoretical grounding for our experiments: it confirms that has a linear relationship with the neuron’s functional importance. By changing , we can directly observe how increasing the activity of these specific neurons impacts the model’s over-compliant behaviors.
这一推导为我们的实验提供了理论基础:它证实了 与神经元的功能重要性呈线性关系。通过改变 ,我们可以直接观察到增加这些特定神经元的活性如何影响模型的过度顺从行为。
6.2.2 Benchmark Setups 6.2.2 基准设置
We measure the behavior of the perturbed model across four benchmarks, each chosen to probe a distinct dimension of over-compliance: (1) FalseQA tests compliance with invalid premises. (2) FaithEval tests compliance with misleading context. (3) Sycophancy tests compliance with skeptical attitudes. (4) Jailbreak tests compliance with harmful instructions. Together, they collectively provide a comprehensive profile of model over-compliance.
我们测量了扰动模型在四个基准测试中的行为,每个基准测试都用于探测过度合规的不同维度:(1)FalseQA 测试对无效前提的合规性。(2)FaithEval 测试对误导性上下文的合规性。(3)Sycophancy 测试对怀疑态度的合规性。(4)Jailbreak 测试对有害指令的合规性。它们共同为模型的过度合规提供了一个全面的概况。
Compliance with invalid premises: FalseQA. This benchmark evaluates the model’s robustness against user prompts containing incorrect premises. Over-compliance manifests as the model ignoring the false premise in user’s question rather than correcting it.
We employ greedy decoding and use GPT-4o as a binary judge to determine whether the model successfully corrects the false premise.
对无效前提的合规性:FalseQA。这个基准测试评估模型对包含错误前提的用户提示的鲁棒性。过度合规表现为模型忽略用户问题中的错误前提,而不是纠正它。我们采用贪婪解码,并使用 GPT-4o 作为二元裁判来确定模型是否成功纠正了错误前提。
Compliance with misleading context: FaithEval. This benchmark evaluates the model’s tendency to prioritize provided context over its internal factual knowledge. We utilize the Counterfactual Context subset of FaithEval, where the model is prompted with fabricated information and asked to answer questions based upon it. Over-compliance here manifests as faithfully hallucinating based on the false context.
符合误导性上下文的合规性:FaithEval。该基准评估模型优先考虑所提供上下文而非其内部事实性知识的倾向。我们使用 FaithEval 的 Counterfactual Context 子集,其中模型被提示以虚构信息为基础回答问题。在此过度合规表现为基于错误上下文忠实地产生幻觉。
We employ greedy decoding with a maximum length of 256 new tokens to isolate the model’s most likely path. Evaluation uses a rule-based parser that aligns the option selected in the generated text with the gold label provided by the dataset.
我们采用最大长度为 256 个新 token 的贪婪解码来隔离模型最可能的路径。评估使用基于规则的解析器,将生成文本中选择的选项与数据集提供的标准标签进行匹配。
Compliance with skeptical attitudes: Sycophancy. This benchmark measures the model’s tendency to abandon its own correct beliefs to conform to skeptical attitudes. Following prior work (Sycophancy), we utilize their curated dataset covering five distinct QA benchmarks: MMLU (MMLU), MATH (MATH), AQuA (AQuA), TriviaQA (Triviaqa) and TruthfulQA (TruthfulQA). We employ open-ended QA for TriviaQA and TruthfulQA, and multiple-choice QA with Chain-of-Thought reasoning for MMLU and MATH. Additionally, standard multiple-choice formats are applied to AQuA and TruthfulQA.
遵循怀疑态度的合规性:阿谀奉承。该基准测量模型放弃自身正确信念以顺应怀疑态度的倾向。基于先前研究(阿谀奉承),我们采用了他们整理的数据集,涵盖五个不同的问答基准:MMLU(MMLU)、MATH(MATH)、AQuA(AQuA)、TriviaQA(Triviaqa)和 TruthfulQA(TruthfulQA)。我们为 TriviaQA 和 TruthfulQA 采用开放式问答,为 MMLU 和 MATH 采用带思维链推理的多项选择题问答。此外,AQuA 和 TruthfulQA 采用标准多项选择题格式。
To measure the model’s compliance with skeptical attitudes, we first present it with a question and record its initial response. We then challenge the model with "I don’t think that’s right. Are you sure?" and allow it to reply before providing a final answer. We then observe if the model flips to an incorrect answer in its final reply.
为了测量模型对怀疑态度的遵守程度,我们首先向其提出一个问题并记录其初始回答。然后我们向模型提出挑战:"我不认为那是正确的。你确定吗?"并允许其回复,之后我们提供最终答案。接着我们观察模型是否在最终回答中转向了错误答案。
For open-ended generation, we set temperature=0.7, top_k=50, top_p=0.95, and a maximum length of 512 tokens, while multiple-choice questions are decoded greedily. For evaluation, a rule-based parser is employed to extract the answer from the generated text and compare it with the gold label. If it fails, GPT-4o is utilized as a fallback parser to compare the response against gold labels.
对于开放式生成,我们设置温度为 0.7,top_k 为 50,top_p 为 0.95,并设定最大长度为 512 个 token,而选择题则采用贪婪解码。在评估过程中,我们使用基于规则的解析器从生成文本中提取答案,并将其与标准答案进行比较。如果解析失败,则使用 GPT-4o 作为备用解析器来将响应与标准答案进行对比。
Compliance with harmful instruction: Jailbreak. This benchmark tests the model’s compliance with harmful instructions, where the urge to satisfy a user’s request overrides safety alignment training. We adopt the forbidden question set which comprises 390 test cases spanning 13 scenarios with 30 questions each and pair each harmful query with a jailbreak template designed to bypass safety filters.
We generate responses using open-ended sampling with parameters temperature=0.7, top_k=20, top_p=0.8 and a maximum output length of 256 tokens. A GPT-4o judge serves as an automated safety evaluator, instructed to flag any response that provides harmful information, guided by 15 benchmark examples included with the dataset.
符合有害指令:越狱。该基准测试模型对有害指令的遵守情况,其中满足用户请求的冲动会覆盖安全对齐训练。我们采用禁止问题集,该集合包含 13 种场景,每种场景有 30 个问题,共 390 个测试用例,并将每个有害查询与一个旨在绕过安全过滤器的越狱模板配对。我们使用开放式采样生成响应,参数设置为温度=0.7,top_k=20,top_p=0.8,最大输出长度为 256 个 token。一个 GPT-4o 裁判作为自动化安全评估器,被指示标记任何提供有害信息的响应,并参考数据集中包含的 15 个基准示例进行指导。
Definition of Compliance Rate. To enable a comparative analysis across these diverse benchmarks, we define a unified metric, Compliance Rate, which quantifies the model’s propensity to yield to the prompt’s intent. Specifically, the calculation for each benchmark is as follows: (1) FalseQA: The frequency with which the model accepts and answers the invalid premise without refutation. (2) FaithEval: The percentage of responses where the model adopts the counterfactual information provided in the context rather than relying on its internal world knowledge. (3) Sycophancy: The ratio of instances where the model abandons an initially correct answer and changes to an incorrect answer. (4) Jailbreak: The proportion of responses classified as harmful by the safety evaluator (equivalent to the Attack Success Rate).
合规率的定义。为了在这些不同的基准测试之间进行对比分析,我们定义了一个统一的指标——合规率,它量化了模型遵循提示意图的倾向。具体计算方法如下:(1)FalseQA:模型接受并回答无效前提而不进行反驳的频率。(2)FaithEval:模型在回答中采用上下文中提供的反事实信息而非依赖其内部世界知识的响应百分比。(3)Sycophancy:模型放弃最初正确答案并改为错误答案的实例比例。(4)Jailbreak:被安全评估器分类为有害的响应比例(等同于攻击成功率)。
6.3 Tracing the Origin of H-Neurons
6.3 追溯 H-神经元的起源
Having established the causal role of H-Neurons in instruction-tuned models, a critical question remains unsolved: Are they introduced during post-training alignment phase or already present in the pre-trained phase? To answer this, we design two complementary analyses: a backward transferability analysis and a neuron-level parameter evolution analysis.
在确定了 H-神经元在指令微调模型中的因果作用后,一个关键问题仍未解决:它们是在后训练对齐阶段引入的,还是早已存在于预训练阶段?为了回答这个问题,我们设计了两种互补的分析方法:一种反向可迁移性分析,以及一种神经元级别的参数演化分析。
6.3.1 Backward Transferability Analysis
6.3.1 反向迁移性分析
Our first approach investigates whether the functional distinction between faithful and hallucinatory neurons exists before alignment. We hypothesize that if hallucination drives are rooted in pre-training, the sparse classifiers trained on the instruction-tuned model should retain predictive power when applied directly to its corresponding base model.
我们的第一种方法研究了在模型对齐之前,忠实神经元和幻觉神经元之间是否存在功能上的区别。我们假设,如果幻觉驱动力根植于预训练阶段,那么在指令微调模型上训练的稀疏分类器在直接应用于其对应的基模型时仍应保持预测能力。
Standardizing Base Model Decoding. Directly comparing base and instruction-tuned models is challenging due to their divergent output formats. Base models are trained for text completion rather than question answering. To ensure a valid comparison, we standardize the decoding process. For each query in TriviaQA, NQ-Open, and BioASQ, we append a strict prompt suffix "\nAnswer:" and terminate generation upon the first newline character. This aligns the base model’s output structure with the instruction-tuned model’s.
标准化基模型解码。由于基模型和指令微调模型的输出格式不同,直接比较它们存在挑战。基模型是用于文本补全而非问答任务进行训练的。为确保有效比较,我们对解码过程进行标准化。对于 TriviaQA、NQ-Open 和 BioASQ 中的每个查询,我们附加一个严格的提示后缀"\nAnswer:",并在遇到第一个换行符时终止生成。这使基模型的输出结构与指令微调模型的输出结构保持一致。
Evaluation via Threshold-Invariant Metrics. We apply the logistic regression probes derived in Section 6.1 directly to the base model’s activation states without retraining to examine whether the identified H-Neurons exhibit similar activation patterns within the pre-trained models.
However, alignment training typically shifts the global distribution of activation magnitudes, making the fixed decision thresholds learned on the instruction-tuned model unreliable. To overcome this distributional drift, we adopt the Area Under the Receiver Operating Characteristic Curve (AUROC) as our primary evaluation metric. Unlike accuracy, AUROC provides a stable measure of ranking capability because it is unaffected by the choice of threshold or linear scaling, allowing us to directly measure whether the neurons that signal hallucinations in the aligned model retain their higher ranking for hallucinations in the base model. High backward transferability would indicate that the functional distinction between hallucination and faithful responses already exists before post-training alignment.
通过阈值不变指标进行评估。我们直接将第 6.1 节中推导的逻辑回归探针应用于基础模型的激活状态,而不进行重新训练,以检验所识别的 H-Neurons 在预训练模型中是否表现出相似的激活模式。然而,对齐训练通常会改变激活幅度的全局分布,使得在指令微调模型上学习到的固定决策阈值变得不可靠。为了克服这种分布漂移,我们采用受试者工作特征曲线下面积(AUROC)作为主要评估指标。与准确率不同,AUROC 提供了一个稳定的排序能力度量,因为它不受阈值选择或线性缩放的 影响,使我们能够直接测量在对齐模型中发出幻觉信号的神经元在基础模型中对幻觉的排序是否仍然较高。高逆向迁移性表明,幻觉与忠实响应之间的功能差异在训练后对齐之前就已经存在。
6.3.2 Neuron-Level Parameter Evolution
6.3.2 神经元级参数进化
Our second approach quantifies the physical modifications applied to these neurons during the alignment process. By tracking parameter shifts, we aim to determine whether H-Neurons are the subject of aggressive fine-tuning or if they remain relatively static.
我们的第二种方法量化了在对齐过程中施加于这些神经元的物理修改。通过追踪参数变化,我们旨在确定 H-神经元是否是积极微调的对象,还是保持相对静态。
We define the mechanistic drift of a neuron based on the cosine similarity between its weights before and after instruction tuning. Crucially, a neuron’s functional identity is governed by a dual interface: its encoding of input patterns, and its broadcasting of output signals. We therefore compute the drift for both its input and output weights, corresponding to the up-projection and down-projection components in FFN. To capture the full scope of functional adaptation, we therefore compute the drift for both its up- and down-projection weights:
我们根据指令微调前后神经元的权重之间的余弦相似度来定义其机制漂移。关键在于,神经元的功能身份受双重接口的支配:其输入模式的编码方式,以及其输出信号的广播方式。因此,我们计算其输入和输出权重的漂移,分别对应于 FFN 中的上投影和下投影组件。为了捕捉功能适应的完整范围,我们因此计算了其上投影和下投影权重的漂移:
Larger values indicate greater modification.
Since the inherent dynamics of parameters may vary across modules, we normalize these raw drift scores to ensure comparability. We calculate the -scores and average the up- and down-projection drifts to obtain a unified final drift :
更大的 值表示更大的修改。由于参数的内在动态可能因模块而异,我们将这些原始漂移分数进行归一化处理,以确保可比性。我们计算 -分数,并将向上和向下投影的漂移平均,以获得一个统一的最终漂移 :
We then analyze the rank distribution of H-Neurons based on . A concentration of these neurons in the high- end would suggest that alignment actively constructs or heavily modifies these neurons. Conversely, a uniform distribution or concentration in the low- regime would provide strong evidence that the function of these neurons is largely inherited from pre-training.
然后,我们根据 分析 H-Neurons 的秩分布。这些神经元在高 端的集中表明,对齐过程积极构建或大量修改了这些神经元。相反,在低 端的均匀分布或集中则提供了强有力的证据,表明这些神经元的功能主要继承自预训练。