LLMs Detect Vulnerabilities by Recognizing Safe Code Patterns, Not Vulnerable Ones, Study Finds
By
[Submitted on 28 May 2026]
Summary
This research paper uses mechanistic interpretability to analyze how LLMs (specifically Gemma-2-2b) detect software vulnerabilities in C/C++ code. By tracing computational pathways with Circuit Tracer across 472 code samples, the study reveals that the model primarily relies on "safety detectors" — attention heads that recognize safe coding patterns — rather than directly detecting vulnerability signatures. When these safety detectors fail to activate, the model classifies code as vulnerable. Key neural components include attention heads in early layers (L5, L7) and MLP neurons in Layer 7. Ablation experiments show that removing Layer 11 drops accuracy from 100% to 6%, and ablating just 20 neurons in Layer 7 reduces it by 50%. The findings demonstrate that LLM vulnerability detection uses sparse, interpretable circuits using only 16% of model capacity.
Source
Key quotes
· 5 pulledthe model primarily relies on safety detectors, attention heads that recognize safe coding patterns, rather than directly detecting vulnerability signatures
When these safety detectors fail to activate, the model classifies code as vulnerable
removing Layer 11 drops vulnerability detection accuracy from 100% to 6%, while ablating just 20 neurons in Layer 7 reduces it by 50%
LLM vulnerability detection uses sparse, interpretable circuits (only 16% of model capacity)
Our findings show that LLM vulnerability detection uses sparse, interpretable circuits, enabling circuit-level explanations for security predictions and targeted improvements to detection systems
You might also wanna read

Security Risks of Malicious Backdoors in Large Language Models
The article explores the security risks associated with Large Language Models (LLMs), particularly the potential for embedding malicious bac
pub.aimind.so·10mo agoStudy Reveals Domain-Camouflaged Injection Attacks Bypass LLM Detection Systems
This research paper identifies a critical vulnerability in injection detectors used to protect LLM agents. The authors demonstrate that when
Open-Source LLM Safety Vulnerabilities: How Chat Template Formatting Gates Alignment in Models Like Gemma and Qwen
This article reveals a critical vulnerability in open-source large language models (LLMs) where safety alignment can be bypassed by simply o
FuzzingBrain V2: Multi-Agent LLM System Achieves 90% Vulnerability Detection Rate and Discovers 29 Zero-Day Flaws
FuzzingBrain V2 is a multi-agent LLM system for automated vulnerability discovery and reproduction in software. It addresses three key chall
FuzzingBrain V2: Multi-Agent LLM System Achieves 90% Vulnerability Detection Rate and Discovers 29 Zero-Day Flaws
FuzzingBrain V2 is a multi-agent LLM system for automated vulnerability discovery and reproduction in software. It addresses three key chall
Benchmarking Frontier LLMs on Real-World CVE Patching: Mixed Results and Methodological Challenges
A comprehensive benchmark evaluation of five frontier large language models (LLMs) testing their ability to fix real-world security vulnerab
Formal Framework for LLM-Verifier Systems: Convergence Theorem and 4/δ Latency Bound
This research paper presents a formal framework for integrating Large Language Models with Formal Verification tools, addressing reliability

Comments
Sign in to join the conversation.
No comments yet. Be the first.