All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

Prompt Injection Explained as a Role Confusion Problem in LLMs

5d ago· 34 min readenInsight

Summary

This paper presents a theory of prompt injection attacks on LLMs, arguing that the root cause is a fundamental flaw in how models perceive roles — they cannot distinguish between their own thoughts and injected content. The authors demonstrate that LLMs identify roles by writing style rather than explicit tags, and exploit this with a technique called CoT Forgery, where fake reasoning is injected that models mistake for their own thoughts. The work connects prompt injection to mechanistic interpretability results, predicts when attacks succeed, and proposes a new research agenda for a "science of roles" in LLMs.

Source

bskyPrompt Injection Explained as a Role Confusion Problem in LLMsrole-confusion.github.io

Key quotes

· 3 pulled
We show prompt injections are driven by a flaw in how LLMs perceive roles.
LLMs can't tell who's speaking. We show they identify roles by writing style, not tags.
We exploit this with CoT Forgery, injecting fake reasoning that models mistake for their own thoughts.
Snippet from the RSS feed
LLMs can't tell who's speaking. We show they identify roles by writing style, not tags, and exploit this with CoT Forgery, injecting fake reasoning that models mistake for their own thoughts.

You might also wanna read

Comments

Sign in to join the conversation.

No comments yet. Be the first.