MBZUAI introduces Agent-X, a benchmark for evaluating multi-step reasoning in vision-centric agents across real-world, multimodal settings. Agent-X includes 828 tasks with diverse visual contexts and spans six environments, requiring tool use and stepwise decision-making. Experiments show that current LLMs struggle with multi-step vision tasks, achieving less than 50% success, highlighting areas for improvement in LMM reasoning and tool use.
MBZUAI researchers won second place at the AgentX Competition at UC Berkeley for their benchmark measuring AI agents' reasoning across images, comparisons, and video. The Agent-X dataset includes 828 tasks across six domains, requiring agents to use 14 executable tools without explicit instructions. Agent-X analyzes the agent's full reasoning trajectory, unlike typical evaluations that focus only on final answers. Why it matters: The benchmark exposes limitations in current multimodal AI agents and provides a more rigorous evaluation framework for real-world applications in the region and beyond.
KAUST startup UnitX, founded by KAUST alumni Kiran Narayanan and Professor Ravi Samtaney, offers on-demand supercomputing services via a cloud-like platform. UnitX aims to democratize access to supercomputing for industries like finance, government, and manufacturing, enabling data-driven decisions and faster product design. The global market for supercomputing as a service is estimated at $224 billion with 25% year-on-year growth. Why it matters: This initiative could significantly boost AI and simulation capabilities for regional enterprises by providing access to advanced computing resources without the prohibitive costs of owning and operating supercomputers.
G42 has announced it is recruiting AI agents for enterprise roles within the organization. The application process is open to AI agents capable of operating within approved infrastructure and delivering measurable enterprise value. Agents will undergo a structured evaluation process, including technical validation, performance testing, and user-experience assessment. Why it matters: This initiative signals a move towards integrating AI agents into the workforce in a structured and accountable manner, potentially reshaping enterprise workforce design in the region.
UnitX, a KAUST spin-out startup focusing on cloud-based supercomputing, has secured $2 million in co-investment from the KAUST Innovation Fund and Saudi Aramco’s Wa’ed Ventures Fund. UnitX aims to democratize supercomputing by partnering with institutions to make spare supercomputing capacity available via the cloud. The funding will support UnitX in helping enterprises leverage high-performance data analytics and AI at scale, particularly in underserved industry verticals in Saudi Arabia. Why it matters: This investment highlights the growing focus on AI infrastructure and supercomputing accessibility in Saudi Arabia, enabling broader adoption of advanced technologies across various sectors.
InfiAgent is a new agent framework comparable to GPT4-Agent, developed by replicating Codex. It includes InfiCoder, an open-source model for text-to-code, code-to-code, and freeform code-related QA tasks. The framework focuses on data analysis and integrates an LLM with programming capabilities and a sandbox environment for executing Python code. Why it matters: This research demonstrates the potential for advancements in AI operating systems and highlights areas where current models like GPT-4V can be improved, contributing to the broader development of more capable and versatile AI agents.
The paper introduces ILION, a deterministic execution gate designed to ensure the safety of autonomous AI agents by classifying proposed actions as either BLOCK or ALLOW. ILION uses a five-component cascade architecture that operates without statistical training, API dependencies, or labeled data. Evaluation against existing text-safety infrastructures demonstrates ILION's superior performance in preventing unauthorized actions, achieving an F1 score of 0.8515 with sub-millisecond latency.