Tutors of tomorrow? A new benchmark for evaluating LLMs

MBZUAI · Significant research

Summary

MBZUAI researchers have developed a new benchmark for evaluating the teaching abilities of large language models (LLMs), earning the SAC Award for Resources and Evaluation at NAACL 2025. The framework aims to measure how effectively LLMs can be used for personalized tutoring, addressing the "two sigma problem" in education. Unlike rule-based tutoring systems, LLMs offer fluency but lack pedagogical principles. Why it matters: This benchmark is a crucial step towards integrating learning science into AI, potentially enabling personalized AI tutors that significantly improve educational outcomes.

Keywords

LLM · MBZUAI · NAACL · Benchmark · Tutoring

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

arXiv · May 17

MBZUAI researchers introduce LLM-BabyBench, a benchmark suite for evaluating grounded planning and reasoning in LLMs. The suite, built on a textual adaptation of the BabyAI grid world, assesses LLMs on predicting action consequences, generating action sequences, and decomposing instructions. Datasets, evaluation harness, and metrics are publicly available to facilitate reproducible assessment.

Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks

arXiv · Aug 13

This paper benchmarks the performance of large language models (LLMs) on Arabic medical natural language processing tasks using the AraHealthQA dataset. The study evaluated LLMs in multiple-choice question answering, fill-in-the-blank, and open-ended question answering scenarios. The results showed that a majority voting solution using Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3 achieved 77% accuracy on MCQs, while other LLMs achieved a BERTScore of 86.44% on open-ended questions. Why it matters: The research highlights both the potential and limitations of current LLMs in Arabic clinical contexts, providing a baseline for future improvements in Arabic medical AI.

SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

arXiv · May 29

MBZUAI researchers introduce SocialMaze, a new benchmark for evaluating social reasoning capabilities in large language models (LLMs). SocialMaze includes six diverse tasks across social reasoning games, daily-life interactions, and digital community platforms, emphasizing deep reasoning, dynamic interaction, and information uncertainty. Experiments show that LLMs vary in handling dynamic interactions, degrade under uncertainty, but can be improved via fine-tuning on curated reasoning examples.

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

arXiv · Feb 1

Researchers from the National Center for AI in Saudi Arabia investigated the sensitivity of Large Language Model (LLM) leaderboards to minor benchmark perturbations. They found that small changes, like choice order, can shift rankings by up to 8 positions. The study recommends hybrid scoring and warns against over-reliance on simple benchmark evaluations, providing code for further research.