How many queries does it take to break an AI? We put a number on it.

MBZUAI · Significant research

Summary

MBZUAI researchers presented a NeurIPS 2024 Spotlight paper that quantifies AI vulnerability by measuring bits leaked per query. Their formula predicts the minimum queries needed for attacks based on mutual information between model output and attacker's target. Experiments across seven models and three attack types (system-prompt extraction, jailbreaks, relearning) validate the relationship. Why it matters: This work offers a framework to translate UI choices (like exposing log-probs or chain-of-thought) into concrete attack surfaces, informing more secure AI design and deployment in the region.

Keywords

jailbreak · LLM · MBZUAI · security · vulnerability

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language Models

arXiv · Oct 7

The paper introduces LLMEffiChecker, a tool to test the computational efficiency robustness of LLMs by identifying vulnerabilities that can significantly degrade performance. LLMEffiChecker uses both white-box (gradient-guided perturbation) and black-box (causal inference-based perturbation) methods to delay the generation of the end-of-sequence token. Experiments on nine public LLMs demonstrate that LLMEffiChecker can substantially increase response latency and energy consumption with minimal input perturbations.

CAPTCHAs aren’t just annoying, they’re a reality check for AI agents

MBZUAI · Invalid Date

MBZUAI researchers created Open CaptchaWorld, a new benchmark to test AI agents on solving CAPTCHAs. The benchmark includes 20 modern CAPTCHA types that require perception, reasoning, and interactive actions within a browser. While humans achieve 93.3% accuracy, the best AI agent only reaches 40% on the benchmark. Why it matters: This research highlights a critical gap in current AI agent capabilities, as CAPTCHAs are gatekeepers to high-value web actions like e-commerce and secure logins.

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

arXiv · Feb 1

Researchers from the National Center for AI in Saudi Arabia investigated the sensitivity of Large Language Model (LLM) leaderboards to minor benchmark perturbations. They found that small changes, like choice order, can shift rankings by up to 8 positions. The study recommends hybrid scoring and warns against over-reliance on simple benchmark evaluations, providing code for further research.

How jailbreak attacks work and a new way to stop them

MBZUAI · Invalid Date

Researchers at MBZUAI and other institutions have published a study at ACL 2024 investigating how jailbreak attacks work on LLMs. The study used a dataset of 30,000 prompts and non-linear probing to interpret the effects of jailbreak attacks, finding that existing interpretations were inadequate. The researchers propose a new approach to improve LLM safety against such attacks by identifying the layers in neural networks where the behavior occurs. Why it matters: Understanding and mitigating jailbreak attacks is crucial for ensuring the responsible and secure deployment of LLMs, particularly in the Arabic-speaking world where these models are increasingly being used.

How many queries does it take to break an AI? We put a number on it.

Summary

Keywords

Related

LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language Models

CAPTCHAs aren’t just annoying, they’re a reality check for AI agents

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

How jailbreak attacks work and a new way to stop them