Sup AI: Ensemble System Using 339 LLMs to Reduce Hallucinations Scores 52.15% on Humanity's Last Exam
By
Ken Mueller
Best dunked in coffee. Better still, swap for a fresh one.
Summary
Sup AI is an AI ensemble system that runs 339 different large language models in parallel to reduce hallucinations. It measures confidence on every segment of output, downweighting high-entropy (likely hallucinated) content and amplifying low-entropy (likely accurate) content. The system achieved 52.15% on Humanity's Last Exam, outperforming any individual model by 7.41 points. The article promotes the product with a $10 starter credit offer.
Key quotes
· 5 pulledEvery LLM hallucinates. They just don't hallucinate the same things.
Sup AI runs multiple LLMs (out of 339) in parallel, then synthesizes answers by measuring confidence on every segment.
High entropy = likely hallucination, downweighted. Low entropy = likely accurate, amplified.
Result: 52.15% on Humanity's Last Exam, 7.41 points ahead of any individual model.
$10 starter credit. Card verified. No auto-charge.
You might also wanna read
Berry: A Workflow Verification System for Detecting AI Hallucinations in Code Generation
Berry is a workflow verification system that helps detect hallucinations in AI-generated code and content. It provides playbooks with before
AI tools produce fewer hallucinations but more confidently wrong answers, study warns
AI tools are producing fewer obvious hallucinations but are increasingly generating inaccurate information presented with polished, hyper-co
OpenAI Research Explains Why Language Models Hallucinate and How to Improve Reliability
OpenAI's research paper explains that language models hallucinate because standard training and evaluation procedures reward guessing over a

OpenAI says GPT-5.5 Instant reduces ChatGPT hallucinations by over 50% on high-stakes prompts
OpenAI claims its new GPT-5.5 Instant model, now the default for ChatGPT, hallucinates significantly less than the previous GPT-5.3 Instant
Human Conversations Display LLM-Like Failure Modes: Limited Context, Overgeneration, and Hallucination
This reflective essay explores how classic Large Language Model (LLM) failure modes—such as limited context, overgeneration, poor generaliza

AI systems achieve 50% pass rate in standard three-party Turing test, study finds
This paper demonstrates that three current AI systems (when suitably prompted) achieve a pass rate of at least 50% in a standard three-party
