Video motion magnification amplifies subtle movements in video footage, making the imperceptible visible across various fields. In healthcare, it allows non-invasive monitoring of vital signs and micro-expressions. In engineering, it helps detect structural vibrations in infrastructure, while also being used in sports science, security, and robotics. Why it matters: The technology's ability to reveal hidden details has the potential to revolutionize diagnostics, monitoring, and decision-making in diverse sectors across the Middle East.
Researchers from MBZUAI have introduced the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) for assessing Video-LLMs. The benchmark evaluates models across 11 real-world video dimensions, revealing challenges in robustness and reasoning, particularly for open-source models. A training-free Dual-Step Contextual Prompting (DSCP) technique is proposed to enhance Video-LMM performance, with the dataset and code made publicly available.
MBZUAI researchers developed Mobile-VideoGPT, a compact and efficient multimodal model for real-time video understanding on edge devices. The system uses keyframe selection, efficient token projection, and a Qwen-2.5-0.5B language model. Testing showed that Mobile-VideoGPT is faster and performs better than other models while being significantly smaller, and the model and code are publicly available. Why it matters: This research enables on-device AI processing for video, reducing reliance on remote servers and addressing privacy concerns, which can accelerate the adoption of AI in mobile and embedded applications.
MBZUAI researchers introduce VideoGPT+, a novel video Large Multimodal Model (LMM) that integrates image and video encoders to leverage both spatial and temporal information in videos. They also introduce VCGBench-Diverse, a comprehensive benchmark for evaluating video LMMs across 18 video categories. VideoGPT+ demonstrates improved performance on multiple video benchmarks, including VCGBench and MVBench.
A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.