Search

Results for "agent tuning"

Fast Rates for Maximum Entropy Exploration

MBZUAI · Invalid Date

This paper addresses exploration in reinforcement learning (RL) in unknown environments with sparse rewards, focusing on maximum entropy exploration. It introduces a game-theoretic algorithm for visitation entropy maximization with improved sample complexity of O(H^3S^2A/ε^2). For trajectory entropy, the paper presents an algorithm with O(poly(S, A, H)/ε) complexity, showing the statistical advantage of regularized MDPs for exploration. Why it matters: The research offers new techniques to reduce the sample complexity of RL, potentially enhancing the efficiency of AI agents in complex environments.

CAPTCHAs aren’t just annoying, they’re a reality check for AI agents

MBZUAI · Invalid Date

MBZUAI researchers created Open CaptchaWorld, a new benchmark to test AI agents on solving CAPTCHAs. The benchmark includes 20 modern CAPTCHA types that require perception, reasoning, and interactive actions within a browser. While humans achieve 93.3% accuracy, the best AI agent only reaches 40% on the benchmark. Why it matters: This research highlights a critical gap in current AI agent capabilities, as CAPTCHAs are gatekeepers to high-value web actions like e-commerce and secure logins.

Multi-agent Time-based Decision-making for the Search and Action Problem

arXiv · Feb 27

This paper introduces a decentralized multi-agent decision-making framework for search and action problems under time constraints, treating time as a budgeted resource where actions have costs and rewards. The approach uses probabilistic reasoning to optimize decisions, maximizing reward within the given time. Evaluated in a simulated search, pick, and place scenario inspired by the Mohamed Bin Zayed International Robotics Challenge (MBZIRC), the algorithm outperformed benchmark strategies. Why it matters: The framework's validation in a Gazebo environment signals potential for real-world robotic applications, particularly in time-sensitive and cooperative tasks within the robotics domain in the UAE.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv · Dec 18

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.