MTG Bench: A benchmark evaluating LLM performance in playing Magic: The Gathering
By
CallumFerg
Kettled twice. Extra chewy, extra trustworthy.
Summary
MTG Bench is a benchmark designed to evaluate how well Large Language Models (LLMs) can play Magic: The Gathering. The article presents results from testing various LLMs (including Fable 5, Gemini 3.5 Flash, Opus 4.8, GPT 5.5) on their ability to understand and execute complex game mechanics like scrying, discovering, and tutoring. It highlights both successes (e.g., Gemini 3.5 Flash handling complex turns) and failures (e.g., Opus 4.8 returning cards to deck incorrectly, GPT 5.5 forgetting to return exiled cards). The benchmark evaluates LLMs on strategic gameplay, rule adherence, and tool use within the Magic: The Gathering card game environment.
Key quotes
· 5 pulledGemini 3.5 flash performs complex turn with scry, discover, and tutor effects
Opus 4.8 erroneously returns a card to the deck then self reports the mistake
Gpt 5.5 forgets to return cards exiled with discover to the deck and self reports the mistake
Fabel 5 makes a tool mistake, then silently tries to restart the turn (caught by evaluation later)
The main idea is that if an LLM is
You might also wanna read
Testing Opus 4.1's NL2SQL capabilities on Netflix streaming data
The article evaluates Anthropic's Opus 4.1 LLM for NL2SQL (natural language to SQL) capabilities, specifically testing it on a personal Netf
BilliardPhys-Bench: New Benchmark Reveals Physical Reasoning Gaps in Multimodal AI Models
This paper introduces BilliardPhys-Bench, a benchmark designed to evaluate multimodal large language models (MLLMs) on intuitive physical re
PerspectiveGap: A New Benchmark Reveals LLMs Struggle with Multi-Agent Orchestration Prompting
The article introduces PerspectiveGap, a benchmark designed to evaluate LLMs' ability to compose orchestration prompts for multi-agent syste
New ASL Benchmark Reveals Sign Language AI Models Overlook Facial and Body Cues
This paper introduces ASL Minimal Translation Pairs (ASL-MTP), a new benchmark dataset for American Sign Language designed to evaluate how w
LLM Stats: Platform for Comparing AI Language Models by Benchmarks, Cost, and Capabilities
LLM Stats is a platform that allows users to compare various AI language models (LLMs) across multiple dimensions including performance benc
LEVANTE-bench: Benchmark Reveals Partial Alignment Between Vision-Language Models and Children's Cognitive Abilities
The article introduces LEVANTE-bench, a benchmark for comparing vision-language models (VLMs) with children's cognitive development. Based o
