All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
Bluesky
Twitter
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

TERMS-Bench: A New Diagnostic Framework for Evaluating LLM Negotiation Agents Beyond Deal Rate

By

[Submitted on 13 May 2026 (v1), last revised 13 Jun 2026 (this version, v2)]

5h ago· 2 min readenInsight

Summary

This article introduces TERMS-Bench (Testbed for Economic Reasoning in Multi-turn Strategy), a new evaluation framework for diagnosing LLM negotiation agents. Unlike existing benchmarks that rely on LLM-vs.-LLM interaction or aggregate metrics like deal rate, TERMS-Bench uses a Bayesian-game framework where the environment itself serves as the verifier by specifying the counterpart's latent type, policy, and payoff structure. The framework turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps. Evaluating 13 LLM agents from major providers, the benchmark reveals that while frontier models saturate deal rate, they diverge significantly in surplus extraction, cue use, belief calibration, and compliance—revealing agent-specific bargaining bottlenecks that prior benchmarks masked.

Key quotes

· 5 pulled
Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation.
These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier.
We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure.
This turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps.
Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.
Snippet from the RSS feed
Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strate

You might also wanna read