MBZUAI researchers have developed a new benchmark for evaluating the teaching abilities of large language models (LLMs), earning the SAC Award for Resources and Evaluation at NAACL 2025. The framework aims to measure how effectively LLMs can be used for personalized tutoring, addressing the "two sigma problem" in education. Unlike rule-based tutoring systems, LLMs offer fluency but lack pedagogical principles. Why it matters: This benchmark is a crucial step towards integrating learning science into AI, potentially enabling personalized AI tutors that significantly improve educational outcomes.
MBZUAI researchers introduce SocialMaze, a new benchmark for evaluating social reasoning capabilities in large language models (LLMs). SocialMaze includes six diverse tasks across social reasoning games, daily-life interactions, and digital community platforms, emphasizing deep reasoning, dynamic interaction, and information uncertainty. Experiments show that LLMs vary in handling dynamic interactions, degrade under uncertainty, but can be improved via fine-tuning on curated reasoning examples.
MBZUAI researchers introduce LLM-BabyBench, a benchmark suite for evaluating grounded planning and reasoning in LLMs. The suite, built on a textual adaptation of the BabyAI grid world, assesses LLMs on predicting action consequences, generating action sequences, and decomposing instructions. Datasets, evaluation harness, and metrics are publicly available to facilitate reproducible assessment.
This research evaluates LLMs like ChatGPT, Llama, Aya, Jais, and ACEGPT on Arabic automated essay scoring (AES) using the AR-AES dataset. The study uses zero-shot, few-shot learning, and fine-tuning approaches while using a mixed-language prompting strategy. ACEGPT performed best among the LLMs with a QWK of 0.67, while a smaller BERT model achieved 0.88. Why it matters: The study highlights challenges faced by LLMs in processing Arabic and provides insights into improving LLM performance in Arabic NLP tasks.
The paper introduces a benchmark of 1,000 multiple-choice questions to evaluate LLMs on Islamic inheritance law ('ilm al-mawarith). Seven LLMs were tested, with o3 and Gemini 2.5 achieving over 90% accuracy, while ALLaM, Fanar, LLaMA, and Mistral scored below 50%. Error analysis revealed limitations in handling structured legal reasoning. Why it matters: This research highlights the challenges and opportunities for adapting LLMs to complex, culturally-specific legal domains like Islamic jurisprudence.