VideoMolmo: Spatio-Temporal Grounding Meets Pointing

arXiv · June 5, 2025 · Significant research

Summary

Researchers from MBZUAI have introduced VideoMolmo, a large multimodal model for spatio-temporal pointing conditioned on textual descriptions. The model incorporates a temporal module with an attention mechanism and a temporal mask fusion pipeline using SAM2 for improved coherence across video sequences. They also curated a dataset of 72k video-caption pairs and introduced VPoS-Bench, a benchmark for evaluating generalization across real-world scenarios, with code and models publicly available.

Keywords

spatio-temporal pointing · multimodal model · video segmentation · MBZUAI · VPoS-Bench

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Video-CoM: Interactive Video Reasoning via Chain of Manipulations

arXiv · Nov 28

Researchers at MBZUAI introduce "Interactive Video Reasoning," a new paradigm enabling models to actively "think with videos" by performing iterative visual actions to gather and refine evidence. They developed Video CoM, which reasons through a Chain of Manipulations (CoM), and constructed Video CoM Instruct, an 18K instruction tuning dataset for multi-step manipulation reasoning. The model is further optimized via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO), achieving strong results across nine video reasoning benchmarks.

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

Summary

Keywords

Related

Video-CoM: Interactive Video Reasoning via Chain of Manipulations