Middle East AI

This Week arXiv

Video-CoM: Interactive Video Reasoning via Chain of Manipulations

arXiv · · Significant research

Summary

Researchers at MBZUAI introduce "Interactive Video Reasoning," a new paradigm enabling models to actively "think with videos" by performing iterative visual actions to gather and refine evidence. They developed Video CoM, which reasons through a Chain of Manipulations (CoM), and constructed Video CoM Instruct, an 18K instruction tuning dataset for multi-step manipulation reasoning. The model is further optimized via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO), achieving strong results across nine video reasoning benchmarks.

Keywords

Video Reasoning · Chain of Manipulations · MLLMs · Reinforcement Learning · MBZUAI

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

CoVR-R:Reason-Aware Composed Video Retrieval

arXiv ·

A new approach to composed video retrieval (CoVR) is presented, which leverages large multimodal models to infer causal and temporal consequences implied by an edit. The method aligns reasoned queries to candidate videos without task-specific finetuning. A new benchmark, CoVR-Reason, is introduced to evaluate reasoning in CoVR.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv ·

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv ·

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

arXiv ·

MBZUAI researchers introduce VideoMathQA, a new benchmark for evaluating mathematical reasoning in videos, requiring models to interpret visual information, text, and spoken cues. The dataset spans 10 mathematical domains with videos ranging from 10 seconds to over 1 hour, and includes multi-step reasoning annotations. The benchmark aims to evaluate temporal cross-modal reasoning and highlights the limitations of existing approaches in complex video-based mathematical problem solving.