Researchers at MBZUAI introduce "Interactive Video Reasoning," a new paradigm enabling models to actively "think with videos" by performing iterative visual actions to gather and refine evidence. They developed Video CoM, which reasons through a Chain of Manipulations (CoM), and constructed Video CoM Instruct, an 18K instruction tuning dataset for multi-step manipulation reasoning. The model is further optimized via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO), achieving strong results across nine video reasoning benchmarks.
This paper presents the synthesis of a 1-DoF six-bar gripper mechanism for aerial grasping, designed for a task in the Mohamed Bin Zayed International Robotics Challenge (MBZIRC) 2020. The synthesis process involves selecting the mechanism class, determining the number of links and joints using algebraic methods, and optimizing link dimensions via geometric programming. The gripper was modeled in CAD software, additively manufactured, and mounted on a UAV with a DC motor for gripping spherical objects. Why it matters: The research contributes to advancements in robotics and aerial manipulation, with potential applications in various industries, particularly for tasks requiring remote object retrieval and manipulation.
A new approach to composed video retrieval (CoVR) is presented, which leverages large multimodal models to infer causal and temporal consequences implied by an edit. The method aligns reasoned queries to candidate videos without task-specific finetuning. A new benchmark, CoVR-Reason, is introduced to evaluate reasoning in CoVR.