All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Scorecard CEO warns of AI agent dangers in high-stakes domains, offers evaluation platform

By

Ben Lang

7mo ago· 2 min readenProduct

Summary

Darius, CEO of Scorecard, shares a cautionary tale about building AI agents in high-stakes domains. He describes how his EMR agent for doctors performed well 95% of the time but dangerously failed in 5% of cases (confusing pediatric/adult dosing, suggesting discontinued meds). He notes similar failures in other AI systems (customer support bots recommending competitors, legal AI inventing case law). Drawing from his experience at Waymo, he positions Scorecard as a solution that combines LLM evals, human feedback, and product signals to help teams evaluate, optimize, and ship AI agents safely.

Key quotes

· 4 pulled
I almost shipped an AI agent that would've killed people
During beta testing, it nailed complex cases 95% of the time. The other 5% it confused pediatric and adult dosing and suggested discontinued medications.
We were all playing whack-a-mole with agent failures, except we couldn't see the moles until customers found them.
At Waymo, we solved this differently
Snippet from the RSS feed
For teams building AI in high-stakes domains, Scorecard combines LLM evals, human feedback, and product signals to help agents learn and improve automatically, so that you can evaluate, optimize, and ship confidently.

You might also wanna read