Middle East AI

This Week arXiv

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

arXiv · · Significant research

Summary

MBZUAI introduces Agent-X, a benchmark for evaluating multi-step reasoning in vision-centric agents across real-world, multimodal settings. Agent-X includes 828 tasks with diverse visual contexts and spans six environments, requiring tool use and stepwise decision-making. Experiments show that current LLMs struggle with multi-step vision tasks, achieving less than 50% success, highlighting areas for improvement in LMM reasoning and tool use.

Keywords

Agent-X · benchmark · multimodal reasoning · vision-centric agents · MBZUAI

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

arXiv ·

Researchers introduce MATRIX, a vision-centric agent tuning framework for robust tool-use reasoning in VLMs. The framework includes M-TRACE, a dataset of 28.5K multimodal tasks with 177K verified trajectories, and Pref-X, a set of 11K automatically generated preference pairs. Experiments show MATRIX consistently outperforms open- and closed-source VLMs across three benchmarks.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv ·

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

CoVR-R:Reason-Aware Composed Video Retrieval

arXiv ·

A new approach to composed video retrieval (CoVR) is presented, which leverages large multimodal models to infer causal and temporal consequences implied by an edit. The method aligns reasoned queries to candidate videos without task-specific finetuning. A new benchmark, CoVR-Reason, is introduced to evaluate reasoning in CoVR.

Towards Robust Multimodal Open-set Test-time Adaptation via Adaptive Entropy-aware Optimization

arXiv ·

This paper introduces Adaptive Entropy-aware Optimization (AEO), a new framework to tackle Multimodal Open-set Test-time Adaptation (MM-OSTTA). AEO uses Unknown-aware Adaptive Entropy Optimization (UAE) and Adaptive Modality Prediction Discrepancy Optimization (AMP) to distinguish unknown class samples during online adaptation by amplifying the entropy difference between known and unknown samples. The study establishes a new benchmark derived from existing datasets with five modalities and evaluates AEO's performance across various domain shift scenarios, demonstrating its effectiveness in long-term and continual MM-OSTTA settings.