The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias
arXiv ·
This study introduces a Probabilistic Graphical Model (PGM) framework utilizing Pearl's do-operator to causally audit LLM safety mechanisms, specifically isolating the effect of injecting cultural demographics into prompts. A large-scale empirical analysis was conducted across seven instruction-tuned models from diverse origins, including the UAE's Falcon3-7B, as well as models from the US, Europe, China, and India, using ToxiGen and BOLD datasets. The findings revealed a disparity between observational and interventional bias, demonstrating that standard fairness metrics can overestimate demographic bias. Western models exhibited higher causal refusal rates for specific demographic groups, while Eastern models showed low overall intervention rates with targeted sensitivities toward regional demographics. Why it matters: This research highlights the geopolitical nuances of LLM safety alignment and the potential for demographic-sensitive over-triggering to restrict benign discourse, which is particularly relevant for diverse regions like the Middle East in developing culturally-aware AI.