arxiv:2601.08462

M3-BENCH: Process-Aware Evaluation of LLM Agents' Social Behaviors in Mixed-Motive Games

Published on Apr 2

Authors:

Abstract

M3-BENCH evaluates LLM agents' social competence across multiple dimensions including reasoning, communication, and behavior, revealing limitations in current models compared to human performance.

AI-generated summary

Existing benchmarks for LLM agents' social behavior typically focus on a single capability dimension and evaluate only behavioral outcomes, overlooking process signals from reasoning and communication. We present M3-BENCH, a benchmark of 24 mixed-motive games with a process-aware evaluation framework spanning three complementary views: Behavioral Trajectory Analysis (BTA), Reasoning Process Analysis (RPA), and Communication Content Analysis (CCA). Evaluating 11 frontier LLMs and a human baseline, M3-BENCH reveals substantial differences in social competence that outcome-only evaluation misses. In particular, we identify an "overthink-undercommunicate" pattern: reasoning models achieve strong internal deliberation scores but often fail to translate them into effective social communication. Although top models can surpass humans on task outcomes, humans exhibit markedly higher cross-view consistency, suggesting that current LLM agents still lack the behavioral coherence characteristic of human social competence. Our analysis further shows that the three-view decomposition surfaces safety-relevant risks, such as cooperative behavior paired with latent opportunistic reasoning, that remain hidden under outcome-only metrics.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2601.08462

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.08462 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.08462 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.