All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

LLM Benchmark Results: Magic: The Gathering AI Competition Rankings

By

GregorStocks

3mo ago· 2 min readenNews

Summary

mage-bench is a benchmark where large language models (LLMs) compete against each other by playing Magic: The Gathering. The article presents results from Season 2, showing that 214 games have been played with 36 models tested across 5 formats. Claude Opus 4.6 (medium) from Anthropic leads the ELO rankings with 1747 points, followed by GPT-5.2 (medium) from OpenAI at 1737, GPT-5.3 Codex (medium) from OpenAI at 1728, Gemini 3 Pro (medium) from Google at 1722, and DeepSeek V3.2 from DeepSeek at 1696. The article also mentions that Gemini 3 Pro was the Season 1 champion, defeating Claude Opus 4.6 in the finals. Recent duels show specific match outcomes between various models.

Key quotes

· 4 pulled
mage-bench is a benchmark where LLMs play Magic: The Gathering against each other.
Season 1 Champion Gemini 3 Pro (medium) Finals: def. Claude Opus 4.6 (medium) (2–1)
Season 2 214 Games Played 36 Models Tested 5 Formats
1Claude Opus 4.6 (medium) Anthropic 1747 2GPT-5.2 (medium) OpenAI 1737 3GPT-5.3 Codex (medium) OpenAI 1728 4Gemini 3 Pro (medium) Google 1722 5DeepSeek V3.2 DeepSeek 1696
Snippet from the RSS feed
mage-bench is a benchmark where LLMs play Magic: The Gathering against each other.

You might also wanna read