All Topics

Technology

Art

Study Finds 67% Disagreement Rate Among Top AI Models on Real-World Fact-Checks

Kosta Jordanov

4d ago· 17 min readenInsight

100/100

Golden Brown

Bagelometer↗

Sesame, salt, and substance. A flagship bake.

Score100TypeanalysisSentimentneutral

Summary

A research study by Lenz Research tested five frontier LLMs on 1,000 real-world fact-check claims submitted by users to a fact-checking platform. The study found that 67% of the time, the top AI models disagreed on the verdict. Unlike benchmark tests with public answer keys, these were real user claims, highlighting significant disagreement among leading AI systems when applied to practical fact-checking scenarios.

Key quotes

· 3 pulled

67% of real fact-checks, top AI models don't agree on the answer.

We presented 1,000 recent real user claims to the five top frontier LLMs and asked each one for a verdict.

These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform.

Snippet from the RSS feed

67% of real-world fact-checks expose disagreement among the five top frontier AI models. Methodology, data, and the full CSV.

You might also wanna read

Study Finds Frontier AI Models Disagree on Two-Thirds of Basic Fact-Check Claims

A new study by researcher Kosta Jordanov at Lenz Research tested five frontier AI models (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 P

decrypt.co·2d ago

A Professional Fact-Checker Explains Why AI Is Unreliable for Accurate Information

A professional fact-checker at WIRED examines the reliability of AI chatbots for factual information, arguing that AI models frequently prod

wired.com·3d ago

Study finds LLMs persist in treating false claims as true despite explicit warnings

A study on fine-tuning large language models (LLMs) reveals that even after explicit warnings that certain claims are false, the models cont

arstechnica.com·1d ago

Major AI models fail EU legal compliance tests, Aithos study finds

Nonprofit AI research foundation Aithos developed a tool called LARA (Legal Assessment for Real-world Agents) to evaluate AI models' complia

theregister.com·4d ago

Study: Major AI systems from Google, OpenAI, and Anthropic frequently violate EU law in controlled tests

A study from Amsterdam-based AI institute Aithos tested 12 AI models (including systems from Google, OpenAI, and Anthropic) across roughly 1

dlvr.it·2d ago

Analysis Finds Google's AI Overviews Serve Misinformation at Massive Scale

A new analysis commissioned by The New York Times and conducted by AI startup Oumi found that Google's AI Overviews are accurate only about

futurism.com·5h ago