Skip to content
GCC AI Research

Towards embodied multi-modal visual understanding

MBZUAI · Notable

Summary

Ivan Laptev from INRIA Paris presented a talk at MBZUAI on embodied multi-modal visual understanding, covering advancements in video understanding tasks like question answering and captioning. The talk highlighted recent work on vision-language navigation and manipulation. He argued that detailed understanding of the physical world through vision is still in early stages, discussing open research directions related to robotics and video generation. Why it matters: The discussion of robotics applications and future research directions in embodied AI could influence the direction of AI research and development in the UAE, particularly at MBZUAI.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

Towards Robust Multimodal Open-set Test-time Adaptation via Adaptive Entropy-aware Optimization

arXiv ·

This paper introduces Adaptive Entropy-aware Optimization (AEO), a new framework to tackle Multimodal Open-set Test-time Adaptation (MM-OSTTA). AEO uses Unknown-aware Adaptive Entropy Optimization (UAE) and Adaptive Modality Prediction Discrepancy Optimization (AMP) to distinguish unknown class samples during online adaptation by amplifying the entropy difference between known and unknown samples. The study establishes a new benchmark derived from existing datasets with five modalities and evaluates AEO's performance across various domain shift scenarios, demonstrating its effectiveness in long-term and continual MM-OSTTA settings.

Cross-modal understanding and generation of multimodal content

MBZUAI ·

Nicu Sebe from the University of Trento presented recent work on video generation, focusing on animating objects in a source image using external information like labels, driving videos, or text. He introduced a Learnable Game Engine (LGE) trained from monocular annotated videos, which maintains states of scenes, objects, and agents to render controllable viewpoints. Why it matters: This talk highlights advancements in cross-modal AI, potentially enabling new applications in gaming, simulation, and content creation within the region.

Foundations of Multisensory Artificial Intelligence

MBZUAI ·

Paul Liang from CMU presented on machine learning foundations for multisensory AI, discussing a theoretical framework for modality interactions. The talk covered cross-modal attention and multimodal transformer architectures, and applications in mental health, pathology, and robotics. Liang's research aims to enable AI systems to integrate and learn from diverse real-world sensory modalities. Why it matters: This highlights the growing importance of multimodal AI research and its potential for advancements across various sectors in the region, including healthcare and robotics.

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

arXiv ·

MBZUAI introduces Agent-X, a benchmark for evaluating multi-step reasoning in vision-centric agents across real-world, multimodal settings. Agent-X includes 828 tasks with diverse visual contexts and spans six environments, requiring tool use and stepwise decision-making. Experiments show that current LLMs struggle with multi-step vision tasks, achieving less than 50% success, highlighting areas for improvement in LMM reasoning and tool use.