Skip to content
GCC AI Research

Search

Results for "MLLM-For3D"

Tracking Meets Large Multimodal Models for Driving Scenario Understanding

arXiv ·

Researchers at MBZUAI have introduced a novel approach to enhance Large Multimodal Models (LMMs) for autonomous driving by integrating 3D tracking information. This method uses a track encoder to embed spatial and temporal data, enriching visual queries and improving the LMM's understanding of driving scenarios. Experiments on DriveLM-nuScenes and DriveLM-CARLA benchmarks demonstrate significant improvements in perception, planning, and prediction tasks compared to baseline models.

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

arXiv ·

Researchers from MBZUAI have introduced VideoMolmo, a large multimodal model for spatio-temporal pointing conditioned on textual descriptions. The model incorporates a temporal module with an attention mechanism and a temporal mask fusion pipeline using SAM2 for improved coherence across video sequences. They also curated a dataset of 72k video-caption pairs and introduced VPoS-Bench, a benchmark for evaluating generalization across real-world scenarios, with code and models publicly available.

Why 3D spatial reasoning still trips up today’s AI systems

MBZUAI ·

MBZUAI researchers have introduced SURPRISE3D, a benchmark for evaluating 3D spatial reasoning in AI systems, along with a 3D Spatial Reasoning Segmentation (3D-SRS) task. The benchmark includes over 900 indoor scenes and 200,000 language queries paired with 3D masks, emphasizing spatial relationships over object naming. A companion paper, MLLM-For3D, explores adapting 2D multimodal LLMs for 3D reasoning. Why it matters: This work addresses a key limitation in current AI, pushing towards embodied AI that can understand and act in 3D environments based on human-like spatial reasoning.

Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling

arXiv ·

The paper introduces UAE-3D, a multi-modal VAE for 3D molecule generation that compresses molecules into a unified latent space, maintaining near-zero reconstruction error. This approach simplifies latent diffusion modeling by eliminating the need to handle multi-modality and equivariance separately. Experiments on GEOM-Drugs and QM9 datasets show UAE-3D establishes new benchmarks in de novo and conditional 3D molecule generation, with significant improvements in efficiency and quality.

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

arXiv ·

Researchers at MBZUAI have introduced EvoLMM, a self-evolving framework for large multimodal models that enhances reasoning capabilities without human-annotated data or reward distillation. EvoLMM uses two cooperative agents, a Proposer and a Solver, which generate image-grounded questions and solve them through internal consistency, using a continuous self-rewarding process. Evaluations using Qwen2.5-VL as the base model showed performance gains of up to 3% on multimodal math-reasoning benchmarks like ChartQA, MathVista, and MathVision using only raw training images.

MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis

arXiv ·

The paper introduces MedPromptX, a clinical decision support system using multimodal large language models (MLLMs), few-shot prompting (FP), and visual grounding (VG) for chest X-ray diagnosis, integrating imagery with EHR data. MedPromptX refines few-shot data dynamically for real-time adjustment to new patient scenarios and narrows the search area in X-ray images. The study introduces MedPromptX-VQA, a new visual question answering dataset, and demonstrates state-of-the-art performance with an 11% improvement in F1-score compared to baselines.