Study finds LLMs corrupt documents during delegated editing workflows, with frontier models averaging 25% content degradation
By
[Submitted on 17 Apr 2026]
Toasted to a respectable shade. No regrets, no crumbs left.
Summary
This paper introduces DELEGATE-52, a benchmark to evaluate how well Large Language Models (LLMs) handle delegated document editing tasks across 52 professional domains. Testing 19 LLMs, the study finds that current models systematically degrade documents during long workflows, with even top-tier frontier models corrupting an average of 25% of document content by the end of extended interactions. The research reveals that agentic tool use does not improve performance, and degradation worsens with larger documents, longer interactions, and distractor files. The authors conclude that current LLMs are unreliable delegates that silently introduce sparse but severe errors that compound over time.
Key quotes
· 3 pulledEven frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely.
Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
Agentic tool use does not improve performance on DELEGATE-52, and degradation severity is exacerbated by document size, length of interaction, or presence of distractor files.
You might also wanna read
Study finds LLMs persist in treating false claims as true despite explicit warnings
A study on fine-tuning large language models (LLMs) reveals that even after explicit warnings that certain claims are false, the models cont
arstechnica.com·23h agoDecompR: A Method for Reducing Weighting Noise in Multi-Stakeholder LLM Alignment
This paper addresses the challenge of aligning large language models (LLMs) with multiple stakeholders who have conflicting preferences. It

Study finds large language models vulnerable to classic persuasion tactics for harmful requests
This study tested whether three widely used large language models (LLMs) are susceptible to classic persuasion principles (authority, social
Why Treating LLMs as Black-Box Problem Solvers Fails: Lessons from Processing 100 Compliance PDFs
The article discusses the author's experience transforming 100 messy compliance PDFs into structured JSON rules. It critiques the common app
LLMTest: Automated LLM Model Selection and Fallback Tool for Developers
LLMTest is a tool created by maker Tom to help developers and "vibe coders" automatically select the best LLM models for AI-powered features
