Technology

Art

Decrypto: A new interactive benchmark for evaluating theory of mind in LLMs

8h ago· 2 min readenInsight

technology science ai research benchmarking

Summary

This article introduces Decrypto, a new interactive language-based benchmark designed to evaluate theory of mind (ToM) and multi-agent reasoning capabilities in large language models (LLMs). It argues that existing benchmarks for ToM in LLMs suffer from narrow scope, confounding factors, and lack of interactivity. Decrypto aims to address these shortcomings by drawing inspiration from cognitive science to create a more robust evaluation framework for assessing how well LLMs can reason about the mental states of other agents in complex multi-agent scenarios.

Source

Twitter / XDecrypto: A new interactive benchmark for evaluating theory of mind in LLMssites.google.com

Key quotes

· 5 pulled

Agentic LLMs are increasingly deployed in complex multi-agent scenarios, interacting, cooperating or competing with human users and other agents alike.

This requires multi-agent reasoning skills, and especially theory of mind (ToM) -- the ability to reason about the 'mental' states of other agents.

Despite that, ToM in LLMs is poorly understood, with existing benchmarks suffering from narrow scope, confounding factors and lack of interactivity.

We thus introduce Decrypto, an interactive language-based benchmark for multi-agent reasoning and ToM.

Drawing inspiration from cognitive science...

Snippet from the RSS feed

Agentic LLMs are increasingly deployed in complex multi-agent scenarios, interacting, cooperating or competing with human users and other agents alike. This requires multi-agent reasoning skills, and especially theory of mind (ToM) -- the ability to reaso

You might also wanna read

New Benchmark Uses Esoteric Programming Languages to Evaluate LLM Reasoning Abilities

Researchers introduce EsoLang-Bench, a new benchmark for evaluating large language models (LLMs) using esoteric programming languages like B

esolang-bench.vercel.app·3mo ago

Prompt Injection Explained as a Role Confusion Problem in LLMs

This paper presents a theory of prompt injection attacks on LLMs, arguing that the root cause is a fundamental flaw in how models perceive r

role-confusion.github.io·5d ago

Metacognition as a Solution to LLM Hallucinations: Expressing Uncertainty Rather Than Answering or Abstaining

This article discusses the persistent problem of hallucinations in large language models (LLMs), arguing that most factuality improvements h

arXiv.org·1mo ago

Comprehensive Survey of Reasoning Failures in Large Language Models

This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame

arxiv.org·4mo ago

The Conceptual Challenge of Evaluating Large Language Models: When Language Fails to Describe Novel Technology

The article examines the psychological and linguistic challenges in evaluating Large Language Models (LLMs), arguing that their novel nature

parsingphase.dev·3mo ago

Theoretical Perspective on Continuous Chain of Thoughts in Reasoning

Large Language Models (LLMs) have shown impressive performance in reasoning tasks using chain-of-thoughts (CoTs) techniques. This article ex

arxiv.org·1y ago

Comments

No comments yet. Be the first.