All Topics

Technology

Art

MTG Bench: A benchmark evaluating LLM performance in playing Magic: The Gathering

CallumFerg

12h ago· 7 min readenInsight

100/100

Golden Brown

Bagelometer↗

Kettled twice. Extra chewy, extra trustworthy.

Score100TypeanalysisSentimentneutral

Summary

MTG Bench is a benchmark designed to evaluate how well Large Language Models (LLMs) can play Magic: The Gathering. The article presents results from testing various LLMs (including Fable 5, Gemini 3.5 Flash, Opus 4.8, GPT 5.5) on their ability to understand and execute complex game mechanics like scrying, discovering, and tutoring. It highlights both successes (e.g., Gemini 3.5 Flash handling complex turns) and failures (e.g., Opus 4.8 returning cards to deck incorrectly, GPT 5.5 forgetting to return exiled cards). The benchmark evaluates LLMs on strategic gameplay, rule adherence, and tool use within the Magic: The Gathering card game environment.

Key quotes

· 5 pulled

Gemini 3.5 flash performs complex turn with scry, discover, and tutor effects

Opus 4.8 erroneously returns a card to the deck then self reports the mistake

Gpt 5.5 forgets to return cards exiled with discover to the deck and self reports the mistake

Fabel 5 makes a tool mistake, then silently tries to restart the turn (caught by evaluation later)

The main idea is that if an LLM is

Snippet from the RSS feed

MTG Bench tests how well LLMs can play Magic.

You might also wanna read

Testing Opus 4.1's NL2SQL capabilities on Netflix streaming data

The article evaluates Anthropic's Opus 4.1 LLM for NL2SQL (natural language to SQL) capabilities, specifically testing it on a personal Netf

thatjeffsmith.com·12d ago

BilliardPhys-Bench: New Benchmark Reveals Physical Reasoning Gaps in Multimodal AI Models

This paper introduces BilliardPhys-Bench, a benchmark designed to evaluate multimodal large language models (MLLMs) on intuitive physical re

arxiv.org·10d ago

PerspectiveGap: A New Benchmark Reveals LLMs Struggle with Multi-Agent Orchestration Prompting

The article introduces PerspectiveGap, a benchmark designed to evaluate LLMs' ability to compose orchestration prompts for multi-agent syste

arxiv.org·1d ago

New ASL Benchmark Reveals Sign Language AI Models Overlook Facial and Body Cues

This paper introduces ASL Minimal Translation Pairs (ASL-MTP), a new benchmark dataset for American Sign Language designed to evaluate how w

arxiv.org·12d ago

LLM Stats: Platform for Comparing AI Language Models by Benchmarks, Cost, and Capabilities

LLM Stats is a platform that allows users to compare various AI language models (LLMs) across multiple dimensions including performance benc

Product Hunt·7mo ago

LEVANTE-bench: Benchmark Reveals Partial Alignment Between Vision-Language Models and Children's Cognitive Abilities

The article introduces LEVANTE-bench, a benchmark for comparing vision-language models (VLMs) with children's cognitive development. Based o

arxiv.org·6d ago