This paper introduces a large-scale historical corpus of written Arabic spanning 1400 years. The corpus was cleaned and processed using Arabic NLP tools, including identification of reused text. The study uses a novel automatic periodization algorithm to study the history of the Arabic language, confirming the division into Modern Standard and Classical Arabic. Why it matters: This resource enables further computational research into the evolution of Arabic and the development of NLP tools for historical texts.
Ekaterina Vylomova from the University of Melbourne gave a talk on using NLP models to advance research in linguistic morphology, typology, and social psychology. The talk covered using models to study morphology, phonetic changes in words over time, and diachronic changes in language semantics. Vylomova presented the UniMorph project, a cross-lingual annotation schema and database with morphological paradigms for over 150 languages. Why it matters: This research demonstrates the potential of NLP to contribute to a deeper understanding of language evolution and structure, with applications in linguistic research and the study of social and cultural changes.
The InterText project, funded by the European Research Council, aims to advance NLP by developing a framework for modeling fine-grained relationships between texts. This approach enables tracing the origin and evolution of texts and ideas. Iryna Gurevych from the Technical University of Darmstadt presented the intertextual approach to NLP, covering data modeling, representation learning, and practical applications. Why it matters: This research could enable a new generation of AI applications for text work and critical reading, with potential applications in collaborative knowledge construction and document revision assistance.
A talk will present two projects related to the use of NLP for estimating a client’s depression severity and well-being. The first project examines emotional coherence between the subjective experience of emotions and emotion expression in therapy using transformer-based emotion recognition models. The second project proposes a semantic pipeline to study depression severity in individuals based on their social media posts by exploring different aggregation methods to answer one of four Beck Depression Inventory (BDI) options per symptom. Why it matters: This research explores how NLP techniques can be applied to mental health assessment, potentially offering new tools for diagnosis and treatment monitoring.
The paper introduces the concept of Arabic Level of Dialectness (ALDi), a continuous variable representing the degree of dialectal Arabic in a sentence, arguing that Arabic exists on a spectrum between MSA and DA. They present the AOC-ALDi dataset, comprising 127,835 sentences manually labeled for dialectness level, derived from news articles and user comments. Experiments show a model trained on AOC-ALDi can identify dialectness levels across various corpora and genres. Why it matters: ALDi provides a more nuanced approach to analyzing Arabic text than binary dialect identification, enabling sociolinguistic analysis of stylistic choices.
This article discusses the increasing concerns about the interpretability of large deep learning models. It highlights a talk by Danish Pruthi, an Assistant Professor at the Indian Institute of Science (IISc), Bangalore, who presented a framework to quantify the value of explanations and the need for holistic model evaluation. Pruthi's talk touched on geographically representative artifacts from text-to-image models and how well conversational LLMs challenge false assumptions. Why it matters: Addressing interpretability and evaluation is crucial for building trustworthy and reliable AI systems, particularly in sensitive applications within the Middle East and globally.
A study investigated language shift from Tibetan to Arabic among Tibetan families who migrated to Saudi Arabia 70 years ago. Data from 96 participants across three age groups revealed significant intergenerational differences in language use. Younger members rarely used Tibetan, while older members used it slightly more, with a p-value of .001 indicating statistical significance.
This survey paper reviews the landscape of Natural Language Processing (NLP) research and applications in the Arab world. It discusses the unique challenges posed by the Arabic language, such as its morphological complexity and dialectal diversity. The paper also presents a historical overview of Arabic NLP and surveys various research areas, including machine translation, sentiment analysis, and speech recognition. Why it matters: The survey provides a comprehensive resource for researchers and practitioners interested in the current state and future directions of Arabic NLP, a field critical for enabling AI technologies to serve Arabic-speaking communities.