Testing the limits of vision language models: A new benchmark dataset presented at ACL

MBZUAI · Significant research

Summary

MBZUAI researchers presented EXAMS-V, a new benchmark dataset for evaluating the reasoning and processing abilities of vision language models (VLMs). EXAMS-V contains over 20,000 multiple-choice questions across 26 subjects and 11 languages, including Arabic. The dataset presents the questions within images, testing the VLM's ability to integrate visual and textual information. Why it matters: This dataset fills a gap in VLM evaluation, providing a valuable resource for assessing and improving the multimodal reasoning capabilities of these models, particularly in diverse languages like Arabic.

Keywords

VLM · benchmark dataset · MBZUAI · EXAMS-V · multimodal

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

arXiv · Jun 4

Researchers have introduced BloomBench, a new cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for Vision-Language Models (VLMs), as part of the Almieyar benchmarking series. Grounded in Bloom's Taxonomy, it systematically evaluates six levels of cognition—Remember, Understand, Apply, Analyze, Evaluate, Create—through carefully designed image-question-answer tasks. A comprehensive study using BloomBench revealed that state-of-the-art VLMs exhibit strong semantic understanding but struggle significantly with factual recall and creative synthesis, alongside a critical performance gap between Arabic and English. Why it matters: This benchmark provides a crucial tool for diagnosing cognitive weaknesses in current VLMs and lays the groundwork for developing more cognitively aligned and inclusive multimodal AI, particularly for cross-lingual applications.

Testing the limits of vision language models: A new benchmark dataset presented at ACL

Summary

Keywords

Related

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models