Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

arXiv · May 30, 2025 · Significant research

Summary

MBZUAI introduces Agent-X, a benchmark for evaluating multi-step reasoning in vision-centric agents across real-world, multimodal settings. Agent-X includes 828 tasks with diverse visual contexts and spans six environments, requiring tool use and stepwise decision-making. Experiments show that current LLMs struggle with multi-step vision tasks, achieving less than 50% success, highlighting areas for improvement in LMM reasoning and tool use.

Keywords

Agent-X · benchmark · multimodal reasoning · vision-centric agents · MBZUAI

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

A new stress test for AI agents that plan, look and click

MBZUAI · Invalid Date

MBZUAI researchers won second place at the AgentX Competition at UC Berkeley for their benchmark measuring AI agents' reasoning across images, comparisons, and video. The Agent-X dataset includes 828 tasks across six domains, requiring agents to use 14 executable tools without explicit instructions. Agent-X analyzes the agent's full reasoning trajectory, unlike typical evaluations that focus only on final answers. Why it matters: The benchmark exposes limitations in current multimodal AI agents and provides a more rigorous evaluation framework for real-world applications in the region and beyond.

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

arXiv · Oct 9

Researchers introduce MATRIX, a vision-centric agent tuning framework for robust tool-use reasoning in VLMs. The framework includes M-TRACE, a dataset of 28.5K multimodal tasks with 177K verified trajectories, and Pref-X, a set of 11K automatically generated preference pairs. Experiments show MATRIX consistently outperforms open- and closed-source VLMs across three benchmarks.

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Summary

Keywords

Related

A new stress test for AI agents that plan, look and click

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning