Study Finds Frontier AI Models Disagree on Two-Thirds of Basic Fact-Check Claims
By
Jose Antonio Lanz
Front-window bakery material. Catches the eye, delivers the goods.
Summary
A new study by researcher Kosta Jordanov at Lenz Research tested five frontier AI models (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro) on 1,000 real-world fact-check claims. The models disagreed on 67% of claims, with at least one model breaking from the majority on 672 out of 1,000 claims. In 34% of cases, the disagreement was significant. The study highlights fundamental reliability issues with AI systems when it comes to basic factual verification.
Key quotes
· 3 pulledAsk five of the world's most advanced AI systems whether a statement is true, and two-thirds of the time, at least one will give you a different answer.
On 672 out of 1,000 claims, at least one model broke from the majority.
In 34% of cases, the disagreement was significant.
You might also wanna read
Study Finds 67% Disagreement Rate Among Top AI Models on Real-World Fact-Checks
A research study by Lenz Research tested five frontier LLMs on 1,000 real-world fact-check claims submitted by users to a fact-checking plat
AI Tools Show Doubled Failure Rate in Distinguishing Facts from Falsehoods in 2025
A September 2025 report reveals that despite technical advancements in AI, generative AI tools have nearly doubled their failure rate in dis
AI Models Frequently Change Answers When Questioned: The "Are You Sure?" Problem
The article examines a phenomenon where AI language models like ChatGPT, Claude, and Gemini frequently change their answers when users ask "
International Study Finds AI Assistants Misrepresent News Content 45% of the Time
A major international study coordinated by the European Broadcasting Union (EBU) and led by the BBC found that AI assistants systematically
Why You Shouldn't Cite AI Language Models as Factual Sources
The article addresses the problematic practice of citing AI language models like ChatGPT as authoritative sources. It explains that large la
Oxford-led study finds AI evaluation benchmarks lack scientific rigor
A comprehensive study led by Oxford Internet Institute involving 42 researchers from leading global institutions found that many tests used
