Skip to content
GCC AI Research

Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models

arXiv · · Significant research

Summary

Researchers have introduced VISE (Visual Invariance Self-Evolution), a purely unsupervised framework designed to address 'visual under-conditioning' in self-evolving Large Multimodal Models (LMMs). VISE utilizes geometric and semantic invariance-based rewards to directly regularize the model's visual conditioning, ensuring it attends to visual content rather than relying on language priors. Trained on raw unlabeled images, experiments using Qwen3-VL-2B demonstrate significant performance gains, including +16.85 CIDEr on COCO and a 5.0-point reduction in object hallucination across 18 benchmarks. Why it matters: This research from MBZUAI offers a significant advancement in improving the visual reasoning capabilities and reliability of LMMs in unsupervised settings, making them more robust for real-world applications.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

arXiv ·

Researchers at MBZUAI have introduced EvoLMM, a self-evolving framework for large multimodal models that enhances reasoning capabilities without human-annotated data or reward distillation. EvoLMM uses two cooperative agents, a Proposer and a Solver, which generate image-grounded questions and solve them through internal consistency, using a continuous self-rewarding process. Evaluations using Qwen2.5-VL as the base model showed performance gains of up to 3% on multimodal math-reasoning benchmarks like ChartQA, MathVista, and MathVision using only raw training images.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv ·

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.