How jailbreak attacks work and a new way to stop them

MBZUAI · Significant research

Summary

Researchers at MBZUAI and other institutions have published a study at ACL 2024 investigating how jailbreak attacks work on LLMs. The study used a dataset of 30,000 prompts and non-linear probing to interpret the effects of jailbreak attacks, finding that existing interpretations were inadequate. The researchers propose a new approach to improve LLM safety against such attacks by identifying the layers in neural networks where the behavior occurs. Why it matters: Understanding and mitigating jailbreak attacks is crucial for ensuring the responsible and secure deployment of LLMs, particularly in the Arabic-speaking world where these models are increasingly being used.

Keywords

jailbreak attacks · LLMs · MBZUAI · ACL · interpretability

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Your voice can jailbreak a speech model – here’s how to stop it, without retraining

MBZUAI · Invalid Date

A new paper from MBZUAI demonstrates that state-of-the-art speech models can be easily jailbroken using audio perturbations to generate harmful content, achieving success rates of 76-93% on models like Qwen2-Audio and LLaMA-Omni. The researchers adapted projected gradient descent (PGD) to the audio domain to optimize waveforms that push the model towards harmful responses. They propose a defense mechanism based on post-hoc activation patching that hardens models at inference time without retraining. Why it matters: This research highlights a critical vulnerability in speech-based LLMs and offers a practical solution, contributing to the development of more secure and trustworthy AI systems in the region and globally.

Formal Methods for Modern Payment Protocols

MBZUAI · Invalid Date

Researchers at ETH Zurich have formalized models of the EMV payment protocol using the Tamarin model checker. They discovered flaws allowing attackers to bypass PIN requirements for high-value purchases on EMV cards like Mastercard and Visa. The team also collaborated with an EMV consortium member to verify the improved EMV Kernel C-8 protocol. Why it matters: This research highlights the importance of formal methods in identifying critical vulnerabilities in widely used payment systems, potentially impacting financial security for consumers in the GCC region and worldwide.

CRC Seminar Series - Cristofaro Mune, Niek Timmers

TII · Mar 17

Cristofaro Mune and Niek Timmers presented a seminar on bypassing unbreakable crypto using fault injection on Espressif ESP32 chips. The presentation detailed how the hardware-based Encrypted Secure Boot implementation of the ESP32 SoC was bypassed using a single EM glitch, without knowing the decryption key. This attack exploited multiple hardware vulnerabilities, enabling arbitrary code execution and extraction of plain-text data from external flash. Why it matters: The research highlights critical security vulnerabilities in embedded systems and the potential for fault injection attacks to bypass secure boot mechanisms, necessitating stronger hardware-level security measures.

How many queries does it take to break an AI? We put a number on it.

MBZUAI · Invalid Date

MBZUAI researchers presented a NeurIPS 2024 Spotlight paper that quantifies AI vulnerability by measuring bits leaked per query. Their formula predicts the minimum queries needed for attacks based on mutual information between model output and attacker's target. Experiments across seven models and three attack types (system-prompt extraction, jailbreaks, relearning) validate the relationship. Why it matters: This work offers a framework to translate UI choices (like exposing log-probs or chain-of-thought) into concrete attack surfaces, informing more secure AI design and deployment in the region.

How jailbreak attacks work and a new way to stop them

Summary

Keywords

Related

Your voice can jailbreak a speech model – here’s how to stop it, without retraining

Formal Methods for Modern Payment Protocols

CRC Seminar Series - Cristofaro Mune, Niek Timmers

How many queries does it take to break an AI? We put a number on it.