ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

arXiv · May 22, 2025 · Significant research

Summary

MBZUAI researchers introduce ARB, the first comprehensive benchmark for evaluating step-by-step multimodal reasoning in Arabic across textual and visual modalities. The benchmark spans 11 diverse domains and includes 1,356 multimodal samples with 5,119 human-curated reasoning steps. Evaluations of 12 state-of-the-art LMMs revealed challenges in coherence, faithfulness, and cultural grounding, highlighting the need for culturally aware AI systems.

Keywords

multimodal reasoning · benchmark · Arabic · LMM · cultural grounding

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

ALARB: An Arabic Legal Argument Reasoning Benchmark

arXiv · Oct 1

Researchers introduce ALARB, a new benchmark for evaluating reasoning in Arabic LLMs using 13K Saudi commercial court cases. The benchmark includes tasks like verdict prediction, reasoning chain completion, and identification of relevant regulations. Instruction-tuning a 12B parameter model on ALARB achieves performance comparable to GPT-4o in verdict prediction and generation.

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

arXiv · May 30

MBZUAI introduces Agent-X, a benchmark for evaluating multi-step reasoning in vision-centric agents across real-world, multimodal settings. Agent-X includes 828 tasks with diverse visual contexts and spans six environments, requiring tool use and stepwise decision-making. Experiments show that current LLMs struggle with multi-step vision tasks, achieving less than 50% success, highlighting areas for improvement in LMM reasoning and tool use.

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

Summary

Keywords

Related

ALARB: An Arabic Legal Argument Reasoning Benchmark

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks