MBZUAI researchers presented EXAMS-V, a new benchmark dataset for evaluating the reasoning and processing abilities of vision language models (VLMs). EXAMS-V contains over 20,000 multiple-choice questions across 26 subjects and 11 languages, including Arabic. The dataset presents the questions within images, testing the VLM's ability to integrate visual and textual information. Why it matters: This dataset fills a gap in VLM evaluation, providing a valuable resource for assessing and improving the multimodal reasoning capabilities of these models, particularly in diverse languages like Arabic.
A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.
MBZUAI researchers, in collaboration with over 70 researchers, have created the Culturally diverse Visual Question Answering (CVQA) benchmark to evaluate cultural understanding in multimodal LLMs. The CVQA dataset includes over 10,000 questions in 31 languages and 13 scripts, testing models on images of local dishes, personalities, and monuments. Testing of several multimodal LLMs on the CVQA benchmark revealed significant challenges, even for top models. Why it matters: This benchmark highlights the need for AI models to better understand diverse cultures, promoting fairness and relevance across different languages and regions.
Researchers from MBZUAI, IBM, and ServiceNow introduced GEOBench-VLM, a benchmark for evaluating vision-language models on Earth observation tasks using satellite and aerial imagery. The benchmark includes over 10,000 human-verified instructions across 31 sub-tasks spanning object classification, localization, change detection, and more. GEOBench-VLM addresses the gap in current VLMs' ability to perform spatially grounded reasoning and change detection in satellite imagery. Why it matters: This benchmark will drive progress in AI's ability to analyze satellite data for critical applications like disaster response, climate monitoring, and urban planning in the Middle East and globally.
MBZUAI researchers have developed a new approach to enhance the generalizability of vision-language models when processing out-of-distribution data. The study, led by Sheng Zhang and involving multiple MBZUAI professors and researchers, addresses the challenge of AI applications needing to manage unforeseen circumstances. The new method aims to improve how these models, which combine natural language processing and computer vision, handle new information not used during training. Why it matters: Improving the adaptability of vision-language models is critical for real-world AI applications like autonomous driving and medical imaging, especially in diverse and changing environments.