A study investigated language shift from Tibetan to Arabic among Tibetan families who migrated to Saudi Arabia 70 years ago. Data from 96 participants across three age groups revealed significant intergenerational differences in language use. Younger members rarely used Tibetan, while older members used it slightly more, with a p-value of .001 indicating statistical significance.
MBZUAI researchers have expanded LLM safety research to Chinese, presenting their work at the 62nd Annual Meeting of the Association for Computational Linguistics in Bangkok. They developed an open-source Chinese dataset of 3,000 prompts translated and localized from the English "Do-Not-Answer" dataset. The dataset includes a "region-specific sensitivity" category to address unique safety risks for Chinese speakers, evaluating if models are over-sensitive in identifying innocuous questions as harmful. Why it matters: This research addresses a critical gap in LLM safety evaluation, ensuring that language models are both safe and effective for diverse linguistic and cultural contexts, particularly in regions with unique sensitivities.
MBZUAI's Hanan Al Darmaki is working to improve automated speech recognition (ASR) for low-resource languages, where labeled data is scarce. She notes that Arabic presents unique challenges due to dialectal variations and a lack of written resources corresponding to spoken dialects. Al Darmaki's research focuses on unsupervised speech recognition to address this gap. Why it matters: Overcoming these challenges can improve virtual assistant effectiveness across diverse languages and enable more inclusive AI applications in the Arabic-speaking world.
Undergraduate students from the University of Electronic Science and Technology of China (UESTC) in Chengdu visited KAUST for a one-week Spring Camp in March. The students, chosen from the top 10 percent of UESTC undergraduates, toured the CEMSE division. The UESTC students shared a presentation about their KAUST experience at the conclusion of the trip. Why it matters: The visit highlights KAUST's ongoing efforts to attract international talent and foster collaborations with leading universities.
This survey paper reviews the landscape of Natural Language Processing (NLP) research and applications in the Arab world. It discusses the unique challenges posed by the Arabic language, such as its morphological complexity and dialectal diversity. The paper also presents a historical overview of Arabic NLP and surveys various research areas, including machine translation, sentiment analysis, and speech recognition. Why it matters: The survey provides a comprehensive resource for researchers and practitioners interested in the current state and future directions of Arabic NLP, a field critical for enabling AI technologies to serve Arabic-speaking communities.
MBZUAI researchers presented studies at EMNLP and ArabicNLP conferences on improving NLP for diverse languages, especially Arabic. One study evaluated ChatGPT and GPT-4's performance across Arabic dialects, finding limitations compared to English. GPT-4 showed better performance than GPT-3.5 in Arabic. Why it matters: This research highlights the need for NLP models to better support the linguistic diversity of Arabic and other languages to avoid widening existing technological gaps.
MBZUAI researchers have released ALM Bench, a new benchmark dataset for evaluating the performance of multimodal LLMs on cultural visual question-answer tasks across 100 languages. The dataset includes over 22,000 question-answer pairs across 19 categories, with a focus on low-resource languages and cultural nuances, including three Arabic dialects. They tested 16 open- and closed-source multimodal LLMs on it, revealing a significant need for greater cultural and linguistic inclusivity. Why it matters: The benchmark aims to improve the inclusivity of multimodal AI systems by addressing the underrepresentation of low-resource languages and cultural contexts.
Researchers introduce AceGPT, a localized large language model (LLM) specifically for Arabic, addressing cultural sensitivity and local values not well-represented in mainstream models. AceGPT incorporates further pre-training with Arabic texts, supervised fine-tuning using native Arabic instructions and GPT-4 responses, and reinforcement learning with AI feedback using a reward model attuned to local culture. Evaluations demonstrate that AceGPT achieves state-of-the-art performance among open Arabic LLMs across several benchmarks. Why it matters: This work advances culturally-aware AI development for Arabic-speaking communities, providing a valuable resource and benchmark for future research.