LLMs Detect Vulnerabilities by Recognizing Safe Code Patterns, Not Vulnerable Ones, Study Finds

[Submitted on 28 May 2026]

9d ago· 2 min readenInsight

technology science cybersecurity artificial intelligence

Summary

This research paper uses mechanistic interpretability to analyze how LLMs (specifically Gemma-2-2b) detect software vulnerabilities in C/C++ code. By tracing computational pathways with Circuit Tracer across 472 code samples, the study reveals that the model primarily relies on "safety detectors" — attention heads that recognize safe coding patterns — rather than directly detecting vulnerability signatures. When these safety detectors fail to activate, the model classifies code as vulnerable. Key neural components include attention heads in early layers (L5, L7) and MLP neurons in Layer 7. Ablation experiments show that removing Layer 11 drops accuracy from 100% to 6%, and ablating just 20 neurons in Layer 7 reduces it by 50%. The findings demonstrate that LLM vulnerability detection uses sparse, interpretable circuits using only 16% of model capacity.

Source

bskyLLMs Detect Vulnerabilities by Recognizing Safe Code Patterns, Not Vulnerable Ones, Study Findsarxiv.org

Key quotes

· 5 pulled

the model primarily relies on safety detectors, attention heads that recognize safe coding patterns, rather than directly detecting vulnerability signatures

When these safety detectors fail to activate, the model classifies code as vulnerable

removing Layer 11 drops vulnerability detection accuracy from 100% to 6%, while ablating just 20 neurons in Layer 7 reduces it by 50%

LLM vulnerability detection uses sparse, interpretable circuits (only 16% of model capacity)

Our findings show that LLM vulnerability detection uses sparse, interpretable circuits, enabling circuit-level explanations for security predictions and targeted improvements to detection systems

Snippet from the RSS feed

Large language models (LLMs) can detect software vulnerabilities, but how do they actually identify vulnerable code? We address this question using mechanistic interpretability; analyzing the internal computations of a neural network to understand its rea

You might also wanna read

Security Risks of Malicious Backdoors in Large Language Models

The article explores the security risks associated with Large Language Models (LLMs), particularly the potential for embedding malicious bac

pub.aimind.so·10mo ago

Study Reveals Domain-Camouflaged Injection Attacks Bypass LLM Detection Systems

This research paper identifies a critical vulnerability in injection detectors used to protect LLM agents. The authors demonstrate that when

arxiv.org·1mo ago

Open-Source LLM Safety Vulnerabilities: How Chat Template Formatting Gates Alignment in Models Like Gemma and Qwen

This article reveals a critical vulnerability in open-source large language models (LLMs) where safety alignment can be bypassed by simply o

teendifferent.substack.com·5mo ago

FuzzingBrain V2: Multi-Agent LLM System Achieves 90% Vulnerability Detection Rate and Discovers 29 Zero-Day Flaws

FuzzingBrain V2 is a multi-agent LLM system for automated vulnerability discovery and reproduction in software. It addresses three key chall

arxiv.org·1mo ago

FuzzingBrain V2: Multi-Agent LLM System Achieves 90% Vulnerability Detection Rate and Discovers 29 Zero-Day Flaws

FuzzingBrain V2 is a multi-agent LLM system for automated vulnerability discovery and reproduction in software. It addresses three key chall

arxiv.org·1mo ago

Benchmarking Frontier LLMs on Real-World CVE Patching: Mixed Results and Methodological Challenges

A comprehensive benchmark evaluation of five frontier large language models (LLMs) testing their ability to fix real-world security vulnerab

giovannigatti.github.io·29d ago

Formal Framework for LLM-Verifier Systems: Convergence Theorem and 4/δ Latency Bound

This research paper presents a formal framework for integrating Large Language Models with Formal Verification tools, addressing reliability

arxiv.org·6mo ago

Comments

No comments yet. Be the first.