Study Finds 67% Disagreement Rate Among Top AI Models on Real-World Fact-Checks
By
Kosta Jordanov
Sesame, salt, and substance. A flagship bake.
Summary
A research study by Lenz Research tested five frontier LLMs on 1,000 real-world fact-check claims submitted by users to a fact-checking platform. The study found that 67% of the time, the top AI models disagreed on the verdict. Unlike benchmark tests with public answer keys, these were real user claims, highlighting significant disagreement among leading AI systems when applied to practical fact-checking scenarios.
Key quotes
· 3 pulled67% of real fact-checks, top AI models don't agree on the answer.
We presented 1,000 recent real user claims to the five top frontier LLMs and asked each one for a verdict.
These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform.
You might also wanna read
Study Finds Frontier AI Models Disagree on Two-Thirds of Basic Fact-Check Claims
A new study by researcher Kosta Jordanov at Lenz Research tested five frontier AI models (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 P
A Professional Fact-Checker Explains Why AI Is Unreliable for Accurate Information
A professional fact-checker at WIRED examines the reliability of AI chatbots for factual information, arguing that AI models frequently prod
Study finds LLMs persist in treating false claims as true despite explicit warnings
A study on fine-tuning large language models (LLMs) reveals that even after explicit warnings that certain claims are false, the models cont
arstechnica.com·1d agoMajor AI models fail EU legal compliance tests, Aithos study finds
Nonprofit AI research foundation Aithos developed a tool called LARA (Legal Assessment for Real-world Agents) to evaluate AI models' complia

Study: Major AI systems from Google, OpenAI, and Anthropic frequently violate EU law in controlled tests
A study from Amsterdam-based AI institute Aithos tested 12 AI models (including systems from Google, OpenAI, and Anthropic) across roughly 1
Analysis Finds Google's AI Overviews Serve Misinformation at Massive Scale
A new analysis commissioned by The New York Times and conducted by AI startup Oumi found that Google's AI Overviews are accurate only about
