A study investigated language shift from Tibetan to Arabic among Tibetan families who migrated to Saudi Arabia 70 years ago. Data from 96 participants across three age groups revealed significant intergenerational differences in language use. Younger members rarely used Tibetan, while older members used it slightly more, with a p-value of .001 indicating statistical significance.
This paper surveys the landscape of code-switched Arabic natural language processing, covering the mixture of Modern Standard Arabic, dialects, and foreign languages. It examines current efforts, challenges, and research gaps in the field. The survey also provides recommendations for future research directions in code-switched Arabic NLP. Why it matters: Understanding code-switching is crucial for developing effective language technologies that can handle the diverse linguistic landscape of the Arab world.
The paper introduces a two-step approach for transliterating Judeo-Arabic text (written in Hebrew script) into Arabic script. The method involves character-level mapping followed by post-correction to fix grammatical and orthographic errors. The authors also benchmarked LLMs on the transliteration task and demonstrate that transliteration enables the use of Arabic NLP tools on Judeo-Arabic. Why it matters: This work makes Judeo-Arabic texts more accessible to Arabic NLP, enabling processing and analysis that was previously impossible.
Thamar Solorio from the University of Houston will discuss machine learning approaches for spontaneous human language processing. The talk will cover adapting multilingual transformers to code-switching data and using data augmentation for domain adaptation in sequence labeling tasks. Solorio will also provide an overview of other research projects at the RiTUAL lab, focusing on the scarcity of labeled data. Why it matters: This presentation addresses key challenges in Arabic NLP related to data scarcity, which is a persistent obstacle in developing effective AI applications for the region.
This paper introduces a new task: detecting propaganda techniques in code-switched text. The authors created and released a corpus of 1,030 English-Roman Urdu code-switched texts annotated with 20 propaganda techniques. Experiments show the importance of directly modeling multilinguality and using the right fine-tuning strategy for this task.
The paper introduces Aladdin-FTI, a system designed for generating and translating dialectal Arabic (DA). Aladdin-FTI supports text generation in Moroccan, Egyptian, Palestinian, Syrian, and Saudi dialects. It also handles bidirectional translation between these dialects, Modern Standard Arabic (MSA), and English. Why it matters: This work contributes to addressing the under-representation of Arabic dialects in NLP research and enables more inclusive Arabic language models.