SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

arXiv · May 29, 2025 · Significant research

Summary

MBZUAI researchers introduce SocialMaze, a new benchmark for evaluating social reasoning capabilities in large language models (LLMs). SocialMaze includes six diverse tasks across social reasoning games, daily-life interactions, and digital community platforms, emphasizing deep reasoning, dynamic interaction, and information uncertainty. Experiments show that LLMs vary in handling dynamic interactions, degrade under uncertainty, but can be improved via fine-tuning on curated reasoning examples.

Keywords

social reasoning · large language models · benchmark · SocialMaze · MBZUAI

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments

arXiv · Jan 3

The paper introduces MIRAGE, a framework for evaluating LLMs' ability to simulate human behaviors in murder mystery games. MIRAGE uses four methods: TII, CIC, ICI and SCI to assess the LLMs' role-playing proficiency. Experiments show that even GPT-4 struggles with the complexities of the MIRAGE framework.

LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

arXiv · May 17

MBZUAI researchers introduce LLM-BabyBench, a benchmark suite for evaluating grounded planning and reasoning in LLMs. The suite, built on a textual adaptation of the BabyAI grid world, assesses LLMs on predicting action consequences, generating action sequences, and decomposing instructions. Datasets, evaluation harness, and metrics are publicly available to facilitate reproducible assessment.

SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

Summary

Keywords

Related

MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments

LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs