Marc Pollefeys from ETH Zurich and Microsoft Spatial AI Lab will discuss building 3D environment representations for assisting humans and robots. The talk covers visual 3D mapping, localization, spatial data access, and navigation using geometry and learning-based methods. It also explores building rich 3D semantic representations for scene interaction via open vocabulary queries leveraging foundation models. Why it matters: Advancements in spatial AI and 3D scene understanding are critical for enabling more capable robots and AI assistants in various applications within the region.
MBZUAI researchers have introduced SURPRISE3D, a benchmark for evaluating 3D spatial reasoning in AI systems, along with a 3D Spatial Reasoning Segmentation (3D-SRS) task. The benchmark includes over 900 indoor scenes and 200,000 language queries paired with 3D masks, emphasizing spatial relationships over object naming. A companion paper, MLLM-For3D, explores adapting 2D multimodal LLMs for 3D reasoning. Why it matters: This work addresses a key limitation in current AI, pushing towards embodied AI that can understand and act in 3D environments based on human-like spatial reasoning.
Krishna Murthy, a postdoc at MIT, researches computational world models to enable robots to understand and operate effectively in the physical world. His work focuses on differentiable computing approaches for spatial perception and interfaces large image, language, and audio models with 3D scenes. Murthy envisions structured world models working with scaling-based approaches to create versatile robot perception and planning algorithms. Why it matters: This research could significantly advance robotics by enabling more sophisticated perception, reasoning, and action capabilities in embodied agents.
Ian Reid, a Professor of Computer Science at the University of Adelaide, gave a talk at MBZUAI on leveraging deep learning to go beyond geometric SLAM. The talk covered using prior domain knowledge to improve map and shape estimation and enabling navigation in unvisited environments. The research aims to turn cameras into devices for flexible, large-scale situational awareness or "Spatial AI" sensors. Why it matters: Integrating deep learning with SLAM could significantly advance robotic navigation and spatial understanding, with applications for autonomous systems in various industries.
A presentation discusses the evolution of Vision-and-Language Navigation (VLN) from benchmarks like Room-to-Room (R2R). It highlights the role of Large Language Models (LLMs) such as GPT-4 in enabling more natural human-machine interactions. The presentation showcases work using LLMs to decode navigational instructions and improve robotic navigation. Why it matters: This research demonstrates the potential of merging vision, language, and robotics for advanced AI applications in navigation and human-computer interaction.