Decrypto: A new interactive benchmark for evaluating theory of mind in LLMs
Summary
This article introduces Decrypto, a new interactive language-based benchmark designed to evaluate theory of mind (ToM) and multi-agent reasoning capabilities in large language models (LLMs). It argues that existing benchmarks for ToM in LLMs suffer from narrow scope, confounding factors, and lack of interactivity. Decrypto aims to address these shortcomings by drawing inspiration from cognitive science to create a more robust evaluation framework for assessing how well LLMs can reason about the mental states of other agents in complex multi-agent scenarios.
Source
Key quotes
· 5 pulledAgentic LLMs are increasingly deployed in complex multi-agent scenarios, interacting, cooperating or competing with human users and other agents alike.
This requires multi-agent reasoning skills, and especially theory of mind (ToM) -- the ability to reason about the 'mental' states of other agents.
Despite that, ToM in LLMs is poorly understood, with existing benchmarks suffering from narrow scope, confounding factors and lack of interactivity.
We thus introduce Decrypto, an interactive language-based benchmark for multi-agent reasoning and ToM.
Drawing inspiration from cognitive science...
You might also wanna read
New Benchmark Uses Esoteric Programming Languages to Evaluate LLM Reasoning Abilities
Researchers introduce EsoLang-Bench, a new benchmark for evaluating large language models (LLMs) using esoteric programming languages like B
Prompt Injection Explained as a Role Confusion Problem in LLMs
This paper presents a theory of prompt injection attacks on LLMs, arguing that the root cause is a fundamental flaw in how models perceive r
Metacognition as a Solution to LLM Hallucinations: Expressing Uncertainty Rather Than Answering or Abstaining
This article discusses the persistent problem of hallucinations in large language models (LLMs), arguing that most factuality improvements h
Comprehensive Survey of Reasoning Failures in Large Language Models
This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame
The Conceptual Challenge of Evaluating Large Language Models: When Language Fails to Describe Novel Technology
The article examines the psychological and linguistic challenges in evaluating Large Language Models (LLMs), arguing that their novel nature
Theoretical Perspective on Continuous Chain of Thoughts in Reasoning
Large Language Models (LLMs) have shown impressive performance in reasoning tasks using chain-of-thoughts (CoTs) techniques. This article ex

Comments
Sign in to join the conversation.
No comments yet. Be the first.