Middle East AI

This Week arXiv

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

arXiv · · Significant research

Summary

A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.

Keywords

multilingual · video LLM · benchmark · ViMUL-Bench · cultural diversity

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv ·

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

arXiv ·

MBZUAI researchers introduce VideoGPT+, a novel video Large Multimodal Model (LMM) that integrates image and video encoders to leverage both spatial and temporal information in videos. They also introduce VCGBench-Diverse, a comprehensive benchmark for evaluating video LMMs across 18 video categories. VideoGPT+ demonstrates improved performance on multiple video benchmarks, including VCGBench and MVBench.

DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding

arXiv ·

MBZUAI researchers introduce DuwatBench, a new benchmark for multimodal understanding of Arabic calligraphy. The dataset contains 1,272 samples across six calligraphic styles with detailed annotations to evaluate visual-text alignment. Evaluation of 13 multimodal models reveals challenges in processing calligraphic variations and artistic distortions, highlighting the need for culturally grounded AI research.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv ·

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.