MBZUAI introduces Agent-X, a benchmark for evaluating multi-step reasoning in vision-centric agents across real-world, multimodal settings. Agent-X includes 828 tasks with diverse visual contexts and spans six environments, requiring tool use and stepwise decision-making. Experiments show that current LLMs struggle with multi-step vision tasks, achieving less than 50% success, highlighting areas for improvement in LMM reasoning and tool use.
MBZUAI researchers won second place at the AgentX Competition at UC Berkeley for their benchmark measuring AI agents' reasoning across images, comparisons, and video. The Agent-X dataset includes 828 tasks across six domains, requiring agents to use 14 executable tools without explicit instructions. Agent-X analyzes the agent's full reasoning trajectory, unlike typical evaluations that focus only on final answers. Why it matters: The benchmark exposes limitations in current multimodal AI agents and provides a more rigorous evaluation framework for real-world applications in the region and beyond.
Researchers introduce MATRIX, a vision-centric agent tuning framework for robust tool-use reasoning in VLMs. The framework includes M-TRACE, a dataset of 28.5K multimodal tasks with 177K verified trajectories, and Pref-X, a set of 11K automatically generated preference pairs. Experiments show MATRIX consistently outperforms open- and closed-source VLMs across three benchmarks.