Jun 2 – Jun 8, 2025

Top Stories

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

arXiv · Jun 8 · NLP LLM

A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.

TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

arXiv · Jun 6 · Research CV

MBZUAI researchers introduce TerraFM, a scalable self-supervised learning model for Earth observation that uses Sentinel-1 and Sentinel-2 imagery. The model unifies radar and optical inputs through modality-specific patch embeddings and adaptive cross-attention fusion. TerraFM achieves strong generalization on classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

arXiv · Jun 5 · Research CV

MBZUAI researchers introduce VideoMathQA, a new benchmark for evaluating mathematical reasoning in videos, requiring models to interpret visual information, text, and spoken cues. The dataset spans 10 mathematical domains with videos ranging from 10 seconds to over 1 hour, and includes multi-step reasoning annotations. The benchmark aims to evaluate temporal cross-modal reasoning and highlights the limitations of existing approaches in complex video-based mathematical problem solving.

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

arXiv · Jun 5 · CV LLM

Researchers from MBZUAI have introduced VideoMolmo, a large multimodal model for spatio-temporal pointing conditioned on textual descriptions. The model incorporates a temporal module with an attention mechanism and a temporal mask fusion pipeline using SAM2 for improved coherence across video sequences. They also curated a dataset of 72k video-caption pairs and introduced VPoS-Bench, a benchmark for evaluating generalization across real-world scenarios, with code and models publicly available.

From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

arXiv · Jun 2 · NLP LLM

This paper introduces a novel evaluation framework for Arabic language models, addressing gaps in linguistic accuracy and cultural alignment. The authors analyze existing datasets and present the Arabic Depth Mini Dataset (ADMD), a curated collection of 490 questions across ten domains. Evaluating GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max using ADMD reveals performance variations, with Claude 3.5 Sonnet achieving the highest accuracy at 30%. Why it matters: The work emphasizes the importance of cultural competence in Arabic language model evaluation, providing practical insights for improvement.

Jun 2 – Jun 8, 2025

Top Stories

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

More This Week

NextEra bootcamp ignites Saudi deep tech ecosystem: 16 startups poised to reshape industries

The Landscape of Arabic Large Language Models (ALLMs): A New Era for Arabic Language Technology

KFAS Conducts TechEdge Program with NBK and Zain - intlbm

MENA startup funding grows in May as Egypt rebounds - Arab News