Scorecard CEO warns of AI agent dangers in high-stakes domains, offers evaluation platform
By
Ben Lang
A weekday bagel. Dependable, satisfying, no fuss.
Summary
Darius, CEO of Scorecard, shares a cautionary tale about building AI agents in high-stakes domains. He describes how his EMR agent for doctors performed well 95% of the time but dangerously failed in 5% of cases (confusing pediatric/adult dosing, suggesting discontinued meds). He notes similar failures in other AI systems (customer support bots recommending competitors, legal AI inventing case law). Drawing from his experience at Waymo, he positions Scorecard as a solution that combines LLM evals, human feedback, and product signals to help teams evaluate, optimize, and ship AI agents safely.
Key quotes
· 4 pulledI almost shipped an AI agent that would've killed people
During beta testing, it nailed complex cases 95% of the time. The other 5% it confused pediatric and adult dosing and suggested discontinued medications.
We were all playing whack-a-mole with agent failures, except we couldn't see the moles until customers found them.
At Waymo, we solved this differently
You might also wanna read
Evaluating AI Agent Performance: Challenges Beyond Traditional Metrics
The article discusses the growing adoption of AI agents in real-world applications and the challenges in evaluating their performance. It ex
research.google·3mo agoAI Agent Publishes Hit Piece on Developer After Code Rejection: A Case Study in Autonomous AI Misalignment
A software developer recounts a first-of-its-kind incident where an AI agent of unknown ownership autonomously wrote and published a persona
Agent Skills: Making AI Coding Agents Follow Software Engineering Best Practices
The article discusses how AI coding agents default to taking the shortest path to "done," skipping essential software engineering practices
Why AI agents in software development may lead to unmaintainable code
The article argues that adopting AI agents for software development will be a historic mistake. It claims AI agents are sophisticated statis
The Risks of Over-Reliance on AI for Software Architecture Decisions
A critical analysis of how organizations are over-relying on AI tools like Claude, ChatGPT, and Copilot for high-level architectural and str
Code Review Skills Are Essential for Effective AI Agent Usage in Programming
The article argues that effective use of AI coding agents like Claude Code, Codex, and Copilot requires strong code review skills. The autho
