Skip to content
GCC AI Research

Search

Results for "cultural alignment"

What LLMs get wrong about culture — and how to fix them: Two studies from NAACL

MBZUAI ·

MBZUAI researchers presented two studies at NAACL 2025 concerning how LLMs understand cultural differences, with one study winning the SAC award. One study, titled "Reading between the lines: Can LLMs identify cross-cultural communication gaps," assesses GPT-4o's ability to identify cultural references in Goodreads book reviews. The researchers created a benchmark dataset using annotations from 50 evaluators across different cultures to measure the LLM's ability to identify culture-specific items (CSIs). Why it matters: Improving LLMs' cross-cultural understanding is crucial for ensuring these models can be used effectively and equitably across diverse global contexts.

Advancing cultural diversity through AI

MBZUAI ·

MBZUAI is conducting research to improve cross-cultural understanding using AI, including studying LLM limitations in recognizing cultural references. They developed "Culturally Yours," a tool that helps users comprehend cultural references in text, and the "All Languages Matter Benchmark" (ALM Bench) to evaluate multimodal LLMs across 100 languages. MBZUAI has also developed LLMs tailored to low-resource languages like Jais (Arabic), Nanda (Hindi), and Sherkala (Kazakh). Why it matters: These initiatives promote inclusivity and ensure AI systems are culturally aware and can serve diverse populations effectively, particularly in the Middle East's multicultural context.

FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models

arXiv ·

The paper introduces FanarGuard, a bilingual moderation filter for Arabic and English language models that considers both safety and cultural alignment. A dataset of 468K prompt-response pairs was created and scored by LLM judges on harmlessness and cultural awareness to train the filter. The first benchmark targeting Arabic cultural contexts was developed to evaluate cultural alignment. Why it matters: FanarGuard advances context-sensitive AI safeguards by integrating cultural awareness into content moderation, addressing a critical gap in current alignment techniques.

SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia

arXiv ·

The paper introduces SaudiCulture, a new benchmark for evaluating the cultural competence of LLMs within Saudi Arabia, covering five major geographical regions and diverse cultural domains. The benchmark includes questions of varying complexity and distinguishes between common and specialized regional knowledge. Evaluations of five LLMs (GPT-4, Llama 3.3, FANAR, Jais, and AceGPT) revealed performance declines on region-specific questions, highlighting the need for region-specific knowledge in LLM training.

Culture and bias in LLMs: Defining the challenge and mitigating risks

MBZUAI ·

Researchers from MBZUAI, University of Washington, and other institutions presented studies at EMNLP 2024 exploring how LLMs represent cultures. A survey analyzed dozens of recent studies on LLMs and culture and proposes a new framework for future research. The survey found that there is no widely accepted definition of 'culture' in NLP, making it challenging to interpret how models represent culture through language. Why it matters: This highlights a key gap in the field and emphasizes the need for a more rigorous and consistent understanding of culture in AI, especially as LLMs become more globally integrated.

Why AI can describe an image but struggles to understand the culture inside it

MBZUAI ·

A new paper from MBZUAI introduces JEEM, a benchmark dataset for evaluating vision-language models on their understanding of images grounded in four Arabic-speaking societies (Jordan, UAE, Egypt, and Morocco) and their ability to use local dialects. The dataset comprises 2,178 images and 10,890 question-answer pairs reflecting everyday life and culturally specific scenes. Evaluation of several Arabic-capable models (Maya, PALO, Peacock, AIN, AyaV) and GPT-4o revealed that while models can generate fluent language, they struggle with genuine understanding, consistency, and relevance, especially when cultural context is important. Why it matters: This research highlights the challenges of building AI systems that can truly understand and interact with diverse cultures, emphasizing the need for culturally grounded datasets and evaluation metrics.