Search

Results for "LLM safety"

AI Safety Research

MBZUAI · Invalid Date

Adel Bibi, a KAUST alumnus and researcher at the University of Oxford, presented his research on AI safety, covering robustness, alignment, and fairness of LLMs. The research addresses challenges in AI systems, alignment issues, and fairness across languages in common tokenizers. Bibi's work includes instruction prefix tuning and its theoretical limitations towards alignment. Why it matters: This research from a leading researcher highlights the importance of addressing safety concerns in LLMs, particularly regarding alignment and fairness in the Arabic language.

The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias

arXiv · May 6

This study introduces a Probabilistic Graphical Model (PGM) framework utilizing Pearl's do-operator to causally audit LLM safety mechanisms, specifically isolating the effect of injecting cultural demographics into prompts. A large-scale empirical analysis was conducted across seven instruction-tuned models from diverse origins, including the UAE's Falcon3-7B, as well as models from the US, Europe, China, and India, using ToxiGen and BOLD datasets. The findings revealed a disparity between observational and interventional bias, demonstrating that standard fairness metrics can overestimate demographic bias. Western models exhibited higher causal refusal rates for specific demographic groups, while Eastern models showed low overall intervention rates with targeted sensitivities toward regional demographics. Why it matters: This research highlights the geopolitical nuances of LLM safety alignment and the potential for demographic-sensitive over-triggering to restrict benign discourse, which is particularly relevant for diverse regions like the Middle East in developing culturally-aware AI.

UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

arXiv · Jul 29

Researchers introduce UnsafeChain, a new safety alignment dataset designed to improve the safety of large reasoning models (LRMs) by focusing on 'hard prompts' that elicit harmful outputs. The dataset identifies and corrects unsafe completions into safe responses, exposing models to unsafe behaviors and guiding their correction. Fine-tuning LRMs on UnsafeChain demonstrates enhanced safety and preservation of general reasoning ability compared to existing datasets like SafeChain and STAR-1.