LLM Benchmark Results: Magic: The Gathering AI Competition Rankings
mage-bench is a benchmark where large language models (LLMs) compete against each other by playing Magic: The Gathering. The article presents results from Season 2, showing that 214 games have been played with 36 models tested across 5 formats. Claude Opus 4.6 (medium) from Anthropic leads the ELO rankings with 1747 points, followed by GPT-5.2 (medium) from
mage-bench.com3mo ago