Skip to content
GCC AI Research

CAPTCHAs aren’t just annoying, they’re a reality check for AI agents

MBZUAI · Significant research

Summary

MBZUAI researchers created Open CaptchaWorld, a new benchmark to test AI agents on solving CAPTCHAs. The benchmark includes 20 modern CAPTCHA types that require perception, reasoning, and interactive actions within a browser. While humans achieve 93.3% accuracy, the best AI agent only reaches 40% on the benchmark. Why it matters: This research highlights a critical gap in current AI agent capabilities, as CAPTCHAs are gatekeepers to high-value web actions like e-commerce and secure logins.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

A new stress test for AI agents that plan, look and click

MBZUAI ·

MBZUAI researchers won second place at the AgentX Competition at UC Berkeley for their benchmark measuring AI agents' reasoning across images, comparisons, and video. The Agent-X dataset includes 828 tasks across six domains, requiring agents to use 14 executable tools without explicit instructions. Agent-X analyzes the agent's full reasoning trajectory, unlike typical evaluations that focus only on final answers. Why it matters: The benchmark exposes limitations in current multimodal AI agents and provides a more rigorous evaluation framework for real-world applications in the region and beyond.

ILION: Deterministic Pre-Execution Safety Gates for Agentic AI Systems

arXiv ·

The paper introduces ILION, a deterministic execution gate designed to ensure the safety of autonomous AI agents by classifying proposed actions as either BLOCK or ALLOW. ILION uses a five-component cascade architecture that operates without statistical training, API dependencies, or labeled data. Evaluation against existing text-safety infrastructures demonstrates ILION's superior performance in preventing unauthorized actions, achieving an F1 score of 0.8515 with sub-millisecond latency.

A next step for embodied agents: Ivan Laptev on world models

MBZUAI ·

MBZUAI Professor Ivan Laptev is working to bridge the gap between data-driven AI systems and embodied agents (robots). He notes challenges in robotics including data scarcity, the need to generate new data through actions, and the requirement for real-time operation. Laptev aims to transfer innovations from computer vision to robotics, addressing these challenges to improve robots' ability to interpret and respond to the complexities of the real world. Why it matters: Overcoming these hurdles is crucial for advancing robotics and enabling robots to effectively interact with and navigate dynamic real-world environments.

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

arXiv ·

MBZUAI introduces Agent-X, a benchmark for evaluating multi-step reasoning in vision-centric agents across real-world, multimodal settings. Agent-X includes 828 tasks with diverse visual contexts and spans six environments, requiring tool use and stepwise decision-making. Experiments show that current LLMs struggle with multi-step vision tasks, achieving less than 50% success, highlighting areas for improvement in LMM reasoning and tool use.