LLM Benchmark Results: Magic: The Gathering AI Competition Rankings
By
GregorStocks
Slow-proofed and worth the wait. Worth its weight in flour.
Summary
mage-bench is a benchmark where large language models (LLMs) compete against each other by playing Magic: The Gathering. The article presents results from Season 2, showing that 214 games have been played with 36 models tested across 5 formats. Claude Opus 4.6 (medium) from Anthropic leads the ELO rankings with 1747 points, followed by GPT-5.2 (medium) from OpenAI at 1737, GPT-5.3 Codex (medium) from OpenAI at 1728, Gemini 3 Pro (medium) from Google at 1722, and DeepSeek V3.2 from DeepSeek at 1696. The article also mentions that Gemini 3 Pro was the Season 1 champion, defeating Claude Opus 4.6 in the finals. Recent duels show specific match outcomes between various models.
Key quotes
· 4 pulledmage-bench is a benchmark where LLMs play Magic: The Gathering against each other.
Season 1 Champion Gemini 3 Pro (medium) Finals: def. Claude Opus 4.6 (medium) (2–1)
Season 2 214 Games Played 36 Models Tested 5 Formats
1Claude Opus 4.6 (medium) Anthropic 1747 2GPT-5.2 (medium) OpenAI 1737 3GPT-5.3 Codex (medium) OpenAI 1728 4Gemini 3 Pro (medium) Google 1722 5DeepSeek V3.2 DeepSeek 1696
You might also wanna read
Meta's Llama 4 Maverick ranks below older rival AI models after benchmark controversy
Meta used an experimental, unreleased version of its Llama 4 Maverick model to achieve a high score on the LM Arena benchmark, prompting the
LLM Stats: Platform for Comparing AI Language Models by Benchmarks, Cost, and Capabilities
LLM Stats is a platform that allows users to compare various AI language models (LLMs) across multiple dimensions including performance benc
LLM SEO Toolkit for Ranking on AI Platforms Like ChatGPT and Google Gemini
The article introduces an LLM SEO toolkit designed to improve rankings across AI platforms like ChatGPT, Claude, Perplexity, and Google Gemi
Datacurve's DeepSWE Benchmark Shows GPT-5.5 Leading AI Coding Models with 70% Pass Rate
A new benchmark called DeepSWE, released by startup Datacurve, reveals significant performance differences among AI coding models that were
LLM SEO Report: Analyze Brand Visibility Across ChatGPT, Google Gemini, and Claude
LLM SEO Report is a tool that allows users to check how major AI language models like ChatGPT, Google Gemini, and Claude perceive brands bas

Google's Gemini 3 AI Model Tops Benchmarks and Leaderboards, Outperforming Competitors
Google's Gemini 3 AI model has been released to widespread acclaim, topping benchmarks and leaderboards while outperforming competitors like
