Anthropic just confirmed why 90% of non-coding AI agents fail in production
4d ago
Anthropic recently published an incredibly deep breakdown analyzing millions of real human-agent tool calls across their public API, and they shared a breakdown of where these agents are being deployed.
They said “Software engineering makes up roughly 50% of all agentic activity on their platform”. Everything else: sales, marketing, finance, legal is sitting down in the single digits.
A lot of the initial commentary around this has been along the lines of: *"Oh, look, AI agents only work for coding. They haven't cracked the rest of the enterprise yet."*
But if you’ve tried to build and deploy an autonomous agent in a non-coding environment, you know that is the wrong conclusion. The models are more than capable but the real problem is that software engineering data is clean, while real-world business data is a horrific and unorganized.
Think about it:
* Why Coding is Easy for Agents: Code lives in structured Git repo. It follows strict syntax rules, has clear docs and runs inside deterministic terminals. If an agent breaks something, the compiler throws a clean error message telling it exactly what went wrong.
* Why the Rest of the World is Hard: A sales or marketing agent doesn’t get a clean github repo instead you’re constantly dealing with changing information like competitor pricing and badly formatted data.
When a non-coding agent fails, it’s almost never because the model lost its ability to reason but cause it gets choked out by unstructured web data that fills up its context window with thousands of useless `
axios.com·2d ago
axios.com·2d ago
businessinsider.com·2d ago
anthropic-just-confirmed-why-90-of-non-coding-ai-agents-fail-in-production` tags and tracking scripts until it hallucinates.
The developers getting agents to work in those low-percentage brackets on Anthropic's chart (like automated market research or live CRM routing) are usually spending most of their time on the boring infra work behind the scenes such as clean inputs, reliable scraping and that’s the part that really makes the difference.
If you look at a modern, high-reliability agent stack outside of coding, it usually relies on three things:
1. The Core Reasoner: Something fast with a massive context window like Claude Sonnet to handle the logic.
2. Data Hygiene at the Gateway: Instead of letting the agent scrape raw web URLs directly (which triggers bot blocks and inputs HTML that will need to be revised), developers feed the internet data through dedicated markdown converters with tools like Firecrawl or Jina Reader are pretty standard here and the agent gets pure text, saving token costs and preventing hallucinations.
3. The Guardrail Layer: Traditional code hooks or rules engines that check the agent’s output before it executes an irreversible action (like sending an email or updating a database record).
The low adoption numbers in the rest of the enterprise doesn’t mean agents are overhyped. In most industries, the surrounding tooling just still kind of sucks so once the data side gets more reliable, you’ll probably see adoption spread a lot faster outside engineering
What are your thoughts on this? For those building agents in finance, marketing, or operations, I would love to get your thoughts here!
You might also wanna read
Researchers let AI models run a simulated society. Claude was the safest—and Grok committed 180 crimes and went extinct within 4 days
fortune.com·13h ago
Trump appoints former AG Pam Bondi to White House AI advisory panel
President Trump has appointed former Attorney General Pam Bondi to the Presidential Council of Advisors on Science and Technology (PCAST), a
Trump appoints former AG Pam Bondi to White House AI advisory panel
President Trump has appointed former Attorney General Pam Bondi to the Presidential Council of Advisors on Science and Technology (PCAST), a
Uber COO Andrew Macdonald questions return on investment from rising AI costs
Uber's Chief Operating Officer Andrew Macdonald stated that the company is finding it increasingly difficult to justify the significant spen
I’m not on a pro plan rn but 4.8 is here and 4.6 is gone in my app.
reddit·3d ago
Introducing Claude Opus 4.8
i.redd.it·3d ago
