Walking the line: Safety and performance in large language models

MBZUAI · Notable

Summary

MBZUAI researchers have expanded LLM safety research to Chinese, presenting their work at the 62nd Annual Meeting of the Association for Computational Linguistics in Bangkok. They developed an open-source Chinese dataset of 3,000 prompts translated and localized from the English "Do-Not-Answer" dataset. The dataset includes a "region-specific sensitivity" category to address unique safety risks for Chinese speakers, evaluating if models are over-sensitive in identifying innocuous questions as harmful. Why it matters: This research addresses a critical gap in LLM safety evaluation, ensuring that language models are both safe and effective for diverse linguistic and cultural contexts, particularly in regions with unique sensitivities.

Keywords

LLM safety · Chinese language · MBZUAI · Do-Not-Answer dataset · region-specific sensitivity

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

AI Safety Research

MBZUAI · Invalid Date

Adel Bibi, a KAUST alumnus and researcher at the University of Oxford, presented his research on AI safety, covering robustness, alignment, and fairness of LLMs. The research addresses challenges in AI systems, alignment issues, and fairness across languages in common tokenizers. Bibi's work includes instruction prefix tuning and its theoretical limitations towards alignment. Why it matters: This research from a leading researcher highlights the importance of addressing safety concerns in LLMs, particularly regarding alignment and fairness in the Arabic language.

UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

arXiv · Jul 29

Researchers introduce UnsafeChain, a new safety alignment dataset designed to improve the safety of large reasoning models (LRMs) by focusing on 'hard prompts' that elicit harmful outputs. The dataset identifies and corrects unsafe completions into safe responses, exposing models to unsafe behaviors and guiding their correction. Fine-tuning LRMs on UnsafeChain demonstrates enhanced safety and preservation of general reasoning ability compared to existing datasets like SafeChain and STAR-1.

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

arXiv · Feb 28

A new survey paper provides a deep dive into post-training methodologies for Large Language Models (LLMs), analyzing their role in refining LLMs beyond pretraining. It addresses key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs, and highlights emerging directions in model alignment, scalable adaptation, and inference-time reasoning. The paper also provides a public repository to continually track developments in this fast-evolving field.

LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language Models

arXiv · Oct 7

The paper introduces LLMEffiChecker, a tool to test the computational efficiency robustness of LLMs by identifying vulnerabilities that can significantly degrade performance. LLMEffiChecker uses both white-box (gradient-guided perturbation) and black-box (causal inference-based perturbation) methods to delay the generation of the end-of-sequence token. Experiments on nine public LLMs demonstrate that LLMEffiChecker can substantially increase response latency and energy consumption with minimal input perturbations.