Skip to content
GCC AI Research

Search

Results for "language learning"

Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

arXiv ·

This paper introduces AraLLaMA, a new Arabic large language model (LLM) trained using a progressive vocabulary expansion method inspired by second language acquisition. The model utilizes a modified byte-pair encoding (BPE) algorithm to dynamically extend the Arabic subwords in its vocabulary during training, balancing the out-of-vocabulary (OOV) ratio. Experiments show AraLLaMA achieves performance comparable to existing Arabic LLMs on various benchmarks, and all models, data, and code will be open-sourced. Why it matters: This work addresses the need for more accessible and performant Arabic LLMs, contributing to democratization of AI in the Arab world.

Processing language like a human

MBZUAI ·

MBZUAI's Hanan Al Darmaki is working to improve automated speech recognition (ASR) for low-resource languages, where labeled data is scarce. She notes that Arabic presents unique challenges due to dialectal variations and a lack of written resources corresponding to spoken dialects. Al Darmaki's research focuses on unsupervised speech recognition to address this gap. Why it matters: Overcoming these challenges can improve virtual assistant effectiveness across diverse languages and enable more inclusive AI applications in the Arabic-speaking world.

A Panoramic Survey of Natural Language Processing in the Arab World

arXiv ·

This survey paper reviews the landscape of Natural Language Processing (NLP) research and applications in the Arab world. It discusses the unique challenges posed by the Arabic language, such as its morphological complexity and dialectal diversity. The paper also presents a historical overview of Arabic NLP and surveys various research areas, including machine translation, sentiment analysis, and speech recognition. Why it matters: The survey provides a comprehensive resource for researchers and practitioners interested in the current state and future directions of Arabic NLP, a field critical for enabling AI technologies to serve Arabic-speaking communities.

LLMs tackle math word problems

MBZUAI ·

MBZUAI researchers presented a study at NAACL 2024 analyzing errors made by open-source LLMs when solving math word problems. The study, led by Ekaterina Kochmar and KV Aditya Srivatsa, investigates characteristics that make math word problems difficult for machines. Llama2-70B was used to test the ability of LLMs to solve these problems, revealing that LLMs can perform math operations correctly but still give the wrong answer. Why it matters: The research aims to improve AI's ability to understand and solve math word problems, potentially leading to better educational applications and teaching methods.

Machine learning and natural language processing in support of interactive automated tutoring for non-native

MBZUAI ·

Ted Briscoe from the University of Cambridge discussed using machine learning and NLP to develop learning-oriented assessment (LOA) for non-native writers. The technology is used in Cambridge English courseware like Empower and Linguaskill, as well as Write and Improve. Briscoe is also the co-founder and CEO of iLexIR Ltd. Why it matters: Improving automated language assessment could significantly enhance online language learning platforms in the Arab world and beyond.

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

arXiv ·

A new method is proposed to reduce the verbosity of LLMs in step-by-step reasoning by retaining moderately easy problems during Reinforcement Learning with Verifiable Rewards (RLVR) training. This approach acts as an implicit length regularizer, preventing the model from excessively increasing output length on harder problems. Experiments using Qwen3-4B-Thinking-2507 show the model achieves baseline accuracy with nearly twice shorter solutions.

Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

arXiv ·

This paper introduces a large-scale historical corpus of written Arabic spanning 1400 years. The corpus was cleaned and processed using Arabic NLP tools, including identification of reused text. The study uses a novel automatic periodization algorithm to study the history of the Arabic language, confirming the division into Modern Standard and Classical Arabic. Why it matters: This resource enables further computational research into the evolution of Arabic and the development of NLP tools for historical texts.