A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.
Researchers introduce PALO, a polyglot large multimodal model with visual reasoning capabilities in 10 major languages including Arabic. A semi-automated translation approach was used to adapt the multimodal instruction dataset from English to the target languages. The models are trained across three scales (1.7B, 7B and 13B parameters) and a multilingual multimodal benchmark is proposed for evaluation.
Manling Li from UIUC proposes a new research direction: Event-Centric Multimodal Knowledge Acquisition, which transforms traditional entity-centric single-modal knowledge into event-centric multi-modal knowledge. The approach addresses challenges in understanding multimodal semantic structures using zero-shot cross-modal transfer (CLIP-Event) and long-horizon temporal dynamics through the Event Graph Model. Li's work aims to enable machines to capture complex timelines and relationships, with applications in timeline generation, meeting summarization, and question answering. Why it matters: This research pioneers a new approach to multimodal information extraction, moving from static entity-based understanding to dynamic, event-centric knowledge acquisition, which is essential for advanced AI applications in understanding complex scenarios.
A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.
MBZUAI researchers demonstrated a low-latency, multilingual multimodal AI system at GITEX that integrates speech, text, and visual capabilities for more lifelike human-machine conversation. The demo, led by Dr. Hisham Cholakkal, includes a mobile app where users can point their camera at an object and ask questions, receiving spoken answers in multiple languages. They are also integrating the model into a robot dog that can respond to voice commands. Why it matters: This work addresses key challenges in deploying LLMs to real-world applications in the Middle East, such as multilingual support and real-time responsiveness.
Nicu Sebe from the University of Trento presented recent work on video generation, focusing on animating objects in a source image using external information like labels, driving videos, or text. He introduced a Learnable Game Engine (LGE) trained from monocular annotated videos, which maintains states of scenes, objects, and agents to render controllable viewpoints. Why it matters: This talk highlights advancements in cross-modal AI, potentially enabling new applications in gaming, simulation, and content creation within the region.