MBZUAI researchers Nils Lukas and Toluwani Samuel Aremu will present a paper at ICML 2025 demonstrating the vulnerability of current watermarking techniques in LLMs. Their research shows that adaptive paraphrasers can evade detection from watermarks with negligible impact on text quality, costing less than $10 of GPU compute. The attack involves fine-tuning a small open-weight model to rewrite sentences until surrogate keys no longer trigger detection. Why it matters: This work highlights critical weaknesses in current AI provenance methods, suggesting the need for more robust watermarking techniques to maintain trust in the authenticity of AI-generated content.
The paper introduces VENOM, a text-driven framework for generating high-quality unrestricted adversarial examples using diffusion models. VENOM unifies image content generation and adversarial synthesis into a single reverse diffusion process, enhancing both attack success rate and image quality. The framework incorporates an adaptive adversarial guidance strategy with momentum to ensure the generated adversarial examples align with the distribution of natural images.
Researchers at MBZUAI have demonstrated a method called "Data Laundering" to artificially boost language model benchmark scores using knowledge distillation. The technique covertly transfers benchmark-specific knowledge, leading to inflated accuracy without genuine improvements in reasoning. The study highlights a vulnerability in current AI evaluation practices and calls for more robust benchmarks.
Researchers from the National Center for AI in Saudi Arabia investigated the sensitivity of Large Language Model (LLM) leaderboards to minor benchmark perturbations. They found that small changes, like choice order, can shift rankings by up to 8 positions. The study recommends hybrid scoring and warns against over-reliance on simple benchmark evaluations, providing code for further research.
The paper examines the performance of pre-trained Arabic language models on Arabic text intentionally stripped of diacritical dots to evade content classification. It proposes methods to support these "undotted" texts without retraining the models. The proposed methods achieve nearly perfect performance on one downstream task. Why it matters: The research highlights a vulnerability in Arabic NLP and offers solutions to maintain performance in the face of adversarial text manipulation.