TERMS-Bench: A New Diagnostic Framework for Evaluating LLM Negotiation Agents Beyond Deal Rate
By
[Submitted on 13 May 2026 (v1), last revised 13 Jun 2026 (this version, v2)]
Crusty in the right places. Worth the chew.
Summary
This article introduces TERMS-Bench (Testbed for Economic Reasoning in Multi-turn Strategy), a new evaluation framework for diagnosing LLM negotiation agents. Unlike existing benchmarks that rely on LLM-vs.-LLM interaction or aggregate metrics like deal rate, TERMS-Bench uses a Bayesian-game framework where the environment itself serves as the verifier by specifying the counterpart's latent type, policy, and payoff structure. The framework turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps. Evaluating 13 LLM agents from major providers, the benchmark reveals that while frontier models saturate deal rate, they diverge significantly in surplus extraction, cue use, belief calibration, and compliance—revealing agent-specific bargaining bottlenecks that prior benchmarks masked.
Key quotes
· 5 pulledNegotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation.
These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier.
We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure.
This turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps.
Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.
You might also wanna read
LLM Skirmish: An Adversarial In-Context Learning Benchmark for Evaluating Large Language Models
The article discusses LLM Skirmish, an adversarial in-context learning benchmark designed to test large language models through competitive
SkillsBench: A Benchmark for Evaluating AI Agent Skills Across Diverse Tasks
SkillsBench is a new benchmark for evaluating how well AI agent skills work across diverse tasks. The benchmark includes 86 tasks across 11
New Benchmark Reveals High Rates of Outcome-Driven Constraint Violations in Autonomous AI Agents
Researchers introduce a new benchmark for evaluating autonomous AI agents' safety, specifically focusing on outcome-driven constraint violat
New Benchmark Uses Esoteric Programming Languages to Evaluate LLM Reasoning Abilities
Researchers introduce EsoLang-Bench, a new benchmark for evaluating large language models (LLMs) using esoteric programming languages like B
MTG Bench: A benchmark evaluating LLM performance in playing Magic: The Gathering
MTG Bench is a benchmark designed to evaluate how well Large Language Models (LLMs) can play Magic: The Gathering. The article presents resu

Alpha Arena: Benchmarking Large Language Models as Quantitative Traders with Real Capital
The article presents Alpha Arena, a benchmark designed to test large language models' capabilities as quantitative traders. Six leading LLMs
