Middle East AI

This Week arXiv

Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

arXiv · · Significant research

Summary

Researchers at MBZUAI have introduced Video-R2, a reinforcement learning approach to improve the consistency and visual grounding of reasoning in multimodal language models. Video-R2 combines timestamp-aware supervised fine-tuning with Group Relative Policy Optimization (GRPO) guided by a Temporal Alignment Reward (TAR). The model demonstrates higher Think Answer Consistency (TAC), Video Attention Score (VAS), and accuracy across multiple benchmarks, showing improved temporal alignment and reasoning coherence for video understanding.

Keywords

multimodal language models · video reasoning · reinforcement learning · temporal alignment · MBZUAI

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

CoVR-R:Reason-Aware Composed Video Retrieval

arXiv ·

A new approach to composed video retrieval (CoVR) is presented, which leverages large multimodal models to infer causal and temporal consequences implied by an edit. The method aligns reasoned queries to candidate videos without task-specific finetuning. A new benchmark, CoVR-Reason, is introduced to evaluate reasoning in CoVR.

Video-CoM: Interactive Video Reasoning via Chain of Manipulations

arXiv ·

Researchers at MBZUAI introduce "Interactive Video Reasoning," a new paradigm enabling models to actively "think with videos" by performing iterative visual actions to gather and refine evidence. They developed Video CoM, which reasons through a Chain of Manipulations (CoM), and constructed Video CoM Instruct, an 18K instruction tuning dataset for multi-step manipulation reasoning. The model is further optimized via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO), achieving strong results across nine video reasoning benchmarks.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv ·

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

arXiv ·

This paper introduces MOTOR, a multimodal retrieval and re-ranking approach for medical visual question answering (MedVQA) that uses grounded captions and optimal transport to capture relationships between queries and retrieved context, leveraging both textual and visual information. MOTOR identifies clinically relevant contexts to augment VLM input, achieving higher accuracy on MedVQA datasets. Empirical analysis shows MOTOR outperforms state-of-the-art methods by an average of 6.45%.