Skip to content
GCC AI Research

Testing the limits of vision language models: A new benchmark dataset presented at ACL

MBZUAI · Significant research

Summary

MBZUAI researchers presented EXAMS-V, a new benchmark dataset for evaluating the reasoning and processing abilities of vision language models (VLMs). EXAMS-V contains over 20,000 multiple-choice questions across 26 subjects and 11 languages, including Arabic. The dataset presents the questions within images, testing the VLM's ability to integrate visual and textual information. Why it matters: This dataset fills a gap in VLM evaluation, providing a valuable resource for assessing and improving the multimodal reasoning capabilities of these models, particularly in diverse languages like Arabic.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

arXiv ·

A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv ·

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.