Video-CoM: Interactive Video Reasoning via Chain of Manipulations
arXiv ·
Researchers at MBZUAI introduce "Interactive Video Reasoning," a new paradigm enabling models to actively "think with videos" by performing iterative visual actions to gather and refine evidence. They developed Video CoM, which reasons through a Chain of Manipulations (CoM), and constructed Video CoM Instruct, an 18K instruction tuning dataset for multi-step manipulation reasoning. The model is further optimized via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO), achieving strong results across nine video reasoning benchmarks.