TERMS-Bench: A New Diagnostic Framework for Evaluating LLM Negotiation Agents Beyond Deal Rate

[Submitted on 13 May 2026 (v1), last revised 13 Jun 2026 (this version, v2)]

5h ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Crusty in the right places. Worth the chew.

Score75TypeanalysisSentimentneutral

Summary

This article introduces TERMS-Bench (Testbed for Economic Reasoning in Multi-turn Strategy), a new evaluation framework for diagnosing LLM negotiation agents. Unlike existing benchmarks that rely on LLM-vs.-LLM interaction or aggregate metrics like deal rate, TERMS-Bench uses a Bayesian-game framework where the environment itself serves as the verifier by specifying the counterpart's latent type, policy, and payoff structure. The framework turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps. Evaluating 13 LLM agents from major providers, the benchmark reveals that while frontier models saturate deal rate, they diverge significantly in surplus extraction, cue use, belief calibration, and compliance—revealing agent-specific bargaining bottlenecks that prior benchmarks masked.

Key quotes

· 5 pulled

Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation.

These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier.

We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure.

This turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps.

Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.

Snippet from the RSS feed

Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strate

You might also wanna read

LLM Skirmish: An Adversarial In-Context Learning Benchmark for Evaluating Large Language Models

The article discusses LLM Skirmish, an adversarial in-context learning benchmark designed to test large language models through competitive

llmskirmish.com·3mo ago

SkillsBench: A Benchmark for Evaluating AI Agent Skills Across Diverse Tasks

SkillsBench is a new benchmark for evaluating how well AI agent skills work across diverse tasks. The benchmark includes 86 tasks across 11

arxiv.org·4mo ago

New Benchmark Reveals High Rates of Outcome-Driven Constraint Violations in Autonomous AI Agents

Researchers introduce a new benchmark for evaluating autonomous AI agents' safety, specifically focusing on outcome-driven constraint violat

arxiv.org·4mo ago

New Benchmark Uses Esoteric Programming Languages to Evaluate LLM Reasoning Abilities

Researchers introduce EsoLang-Bench, a new benchmark for evaluating large language models (LLMs) using esoteric programming languages like B

esolang-bench.vercel.app·2mo ago

MTG Bench: A benchmark evaluating LLM performance in playing Magic: The Gathering

MTG Bench is a benchmark designed to evaluate how well Large Language Models (LLMs) can play Magic: The Gathering. The article presents resu

mtgautodeck.com·5d ago

Alpha Arena: Benchmarking Large Language Models as Quantitative Traders with Real Capital

The article presents Alpha Arena, a benchmark designed to test large language models' capabilities as quantitative traders. Six leading LLMs

nof1.ai·7mo ago