February 26, 2026

H-Neurons: The Neurons That Make AI Lie

19 minutes

Episode 016: H-Neurons — The Neurons That Make AI Lie

A single-paper deep dive into "H-Neurons" by Cheng Gao and colleagues at Tsinghua University. The paper identifies a remarkably sparse subset of neurons — less than 0.1% of total neurons in a large language model — whose activation patterns reliably predict when the model is about to hallucinate. These "H-Neurons" generalize across domains, tasks, and even to questions about completely fabricated entities. Most provocatively, the paper finds these neurons are causally linked to over-compliance behaviors: amplifying them makes models more sycophantic, more susceptible to misleading context, and more willing to follow harmful instructions. The uncomfortable implication: alignment training itself may be what makes models lie.

The Paper

H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs — Cheng Gao et al., Tsinghua University. December 2025. 20 pages, 4 figures.

PDF · HTML version

Key Findings

Sparse predictive neurons: Less than 0.1% of MLP neurons can predict hallucination with 70–83% accuracy (in-domain) and strong cross-domain generalization.

Cross-domain transfer: H-Neurons identified on TriviaQA generalize to NQ-Open, BioASQ (biomedical), and fabricated entity questions — some models hit 95–97% detection accuracy on fabricated entities.

Causal link to over-compliance: Amplifying H-Neuron activations increases sycophancy, susceptibility to false premises, compliance with harmful instructions, and acceptance of misleading context.

Pre-training origin: H-Neurons emerge during pre-training, not during RLHF/alignment — they persist in base models before any instruction tuning.

CETT metric: Measures individual neuron contribution to hidden states relative to total layer output, enabling fine-grained neuron-level analysis.

Models Studied

Mistral 7B (Mistral AI)

Mistral Small 24B (Mistral AI)

Gemma 3 4B (Google)

Gemma 3 27B (Google)

Llama 3.1 8B (Meta)

Llama 3.3 70B (Meta)

Key Researchers

Cheng Gao — Tsinghua University, Department of Computer Science and Technology. Lead author.

Maosong Sun — Google Scholar · Tsinghua University NLP Lab. One of China's most cited NLP researchers; advisor and senior author.

Methodology

Identification: Sparse logistic regression (L1-regularized) trained on neuron activations during TriviaQA question answering. The L1 penalty drives most weights to zero, naturally selecting the most predictive neurons.

Measurement: CETT (Contribution to hidden state relative to total layer output) — quantifies each neuron's contribution during the forward pass.

Intervention: Controlled scaling of H-Neuron activation magnitudes up and down to measure causal impact on model behavior across over-compliance benchmarks.

Evaluation datasets: TriviaQA, Natural Questions (NQ-Open), BioASQ, plus custom fabricated entity questions.

Related Concepts

Mechanistic Interpretability — Understanding neural networks by analyzing their internal components (neurons, circuits, attention heads) rather than treating them as black boxes.

Sparse Probing — Using L1-regularized classifiers to identify which neurons encode specific features, revealing interpretable structure in neural representations.

Hallucination Detection — Methods for identifying when LLMs generate plausible but factually incorrect outputs, including uncertainty estimation, self-consistency checks, and now neuron-level analysis.

MLP Neurons / Feedforward Networks — The feedforward layers in transformers where individual neurons can encode discrete features; the focus of the H-Neuron analysis.

Over-Compliance / Sycophancy — The tendency of RLHF-trained models to prioritize user-pleasing responses over factual accuracy.

Related Prior Work

Representation Engineering: A Top-Down Approach to AI Transparency (Zou et al., 2023) — Finding linear directions in activation space that correspond to behavioral properties.

Sparse Probing: Finding Interpretable Features in Neural Networks — L1-regularized probes to identify which neurons encode what.

A Survey on Hallucination in Large Language Models (Huang et al., 2023) — Comprehensive survey of hallucination types, causes, and mitigation strategies.

Sycophancy in Large Language Models (Sharma et al., 2023) — Systematic analysis of sycophantic behavior in RLHF-trained models.

Locating and Editing Factual Associations in GPT (Meng et al., 2022) — ROME: editing specific facts by modifying MLP layers, foundational work on knowledge localization in transformers.

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets (Marks & Tegmark, 2023) — Linear probes detecting truth vs. falsehood in LLM representations.

Discussion

The episode explores the tension between the paper's findings and the broader question of what "hallucination" actually is. Reddit commenters raised pointed objections: (1) hallucination is not a single failure mode — it encompasses confabulated citations, logical errors, and confident nonsense, and it's unclear they all share a mechanism; (2) modern training incentivizes hallucination — models that guess confidently score higher than models that say "I don't know"; (3) these neurons may be detecting uncertainty rather than causing hallucination. The paper's finding that H-Neurons emerge during pre-training (not alignment) complicates the narrative, suggesting the capacity for over-compliance is baked in from the start.

Daily Tech Feed: From the Labs is available on Apple Podcasts, Spotify, and wherever fine podcasts are distributed. Visit us at pod.c457.org for all our shows. New episodes daily.

This episode was researched and written with AI assistance. Technical claims have been verified against the primary paper.

...more

View all episodes

By Daily Tech Feed

February 26, 2026

H-Neurons: The Neurons That Make AI Lie

19 minutes

Episode 016: H-Neurons — The Neurons That Make AI Lie

The Paper

H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs — Cheng Gao et al., Tsinghua University. December 2025. 20 pages, 4 figures.

PDF · HTML version

Key Findings

Sparse predictive neurons: Less than 0.1% of MLP neurons can predict hallucination with 70–83% accuracy (in-domain) and strong cross-domain generalization.

Causal link to over-compliance: Amplifying H-Neuron activations increases sycophancy, susceptibility to false premises, compliance with harmful instructions, and acceptance of misleading context.

Pre-training origin: H-Neurons emerge during pre-training, not during RLHF/alignment — they persist in base models before any instruction tuning.

CETT metric: Measures individual neuron contribution to hidden states relative to total layer output, enabling fine-grained neuron-level analysis.

Models Studied

Mistral 7B (Mistral AI)

Mistral Small 24B (Mistral AI)

Gemma 3 4B (Google)

Gemma 3 27B (Google)

Llama 3.1 8B (Meta)

Llama 3.3 70B (Meta)

Key Researchers

Cheng Gao — Tsinghua University, Department of Computer Science and Technology. Lead author.

Maosong Sun — Google Scholar · Tsinghua University NLP Lab. One of China's most cited NLP researchers; advisor and senior author.

Methodology

Measurement: CETT (Contribution to hidden state relative to total layer output) — quantifies each neuron's contribution during the forward pass.

Intervention: Controlled scaling of H-Neuron activation magnitudes up and down to measure causal impact on model behavior across over-compliance benchmarks.

Evaluation datasets: TriviaQA, Natural Questions (NQ-Open), BioASQ, plus custom fabricated entity questions.

Related Concepts

Mechanistic Interpretability — Understanding neural networks by analyzing their internal components (neurons, circuits, attention heads) rather than treating them as black boxes.

Sparse Probing — Using L1-regularized classifiers to identify which neurons encode specific features, revealing interpretable structure in neural representations.

MLP Neurons / Feedforward Networks — The feedforward layers in transformers where individual neurons can encode discrete features; the focus of the H-Neuron analysis.

Over-Compliance / Sycophancy — The tendency of RLHF-trained models to prioritize user-pleasing responses over factual accuracy.

Related Prior Work

Representation Engineering: A Top-Down Approach to AI Transparency (Zou et al., 2023) — Finding linear directions in activation space that correspond to behavioral properties.

Sparse Probing: Finding Interpretable Features in Neural Networks — L1-regularized probes to identify which neurons encode what.

A Survey on Hallucination in Large Language Models (Huang et al., 2023) — Comprehensive survey of hallucination types, causes, and mitigation strategies.

Sycophancy in Large Language Models (Sharma et al., 2023) — Systematic analysis of sycophantic behavior in RLHF-trained models.

Locating and Editing Factual Associations in GPT (Meng et al., 2022) — ROME: editing specific facts by modifying MLP layers, foundational work on knowledge localization in transformers.

Discussion

Daily Tech Feed: From the Labs is available on Apple Podcasts, Spotify, and wherever fine podcasts are distributed. Visit us at pod.c457.org for all our shows. New episodes daily.

This episode was researched and written with AI assistance. Technical claims have been verified against the primary paper.

...more

Share H-Neurons: The Neurons That Make AI Lie

Sign up to save your podcasts

H-Neurons: The Neurons That Make AI Lie

H-Neurons: The Neurons That Make AI Lie