Search

Results for "low-resource"

Challenges in low-resourced NLP: an Irish case study

MBZUAI · Invalid Date

Dr. Teresa Lynn from Dublin City University (DCU) discussed the challenges in developing NLP tools for Irish, a low-resource language facing digital extinction. She highlighted the lack of speech and language applications and fundamental language resources for Irish. Lynn also mentioned her work at DCU on the GaelTech project and her involvement in the European Language Equality project. Why it matters: The development of NLP tools for low-resource languages like Irish is crucial for preserving linguistic diversity and preventing digital marginalization in the AI era.

Addressing NLP problems in low resource settings

MBZUAI · Invalid Date

Thamar Solorio from the University of Houston will discuss machine learning approaches for spontaneous human language processing. The talk will cover adapting multilingual transformers to code-switching data and using data augmentation for domain adaptation in sequence labeling tasks. Solorio will also provide an overview of other research projects at the RiTUAL lab, focusing on the scarcity of labeled data. Why it matters: This presentation addresses key challenges in Arabic NLP related to data scarcity, which is a persistent obstacle in developing effective AI applications for the region.

Processing language like a human

MBZUAI · Invalid Date

MBZUAI's Hanan Al Darmaki is working to improve automated speech recognition (ASR) for low-resource languages, where labeled data is scarce. She notes that Arabic presents unique challenges due to dialectal variations and a lack of written resources corresponding to spoken dialects. Al Darmaki's research focuses on unsupervised speech recognition to address this gap. Why it matters: Overcoming these challenges can improve virtual assistant effectiveness across diverse languages and enable more inclusive AI applications in the Arabic-speaking world.

Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and Multi-Domain Testing

arXiv · Dec 23

Researchers fine-tuned the Qwen2-1.5B model for Arabic using QLoRA on a 4GB VRAM system, using datasets like Bactrian and Arabic Wikipedia. They addressed challenges in Arabic NLP including morphology and dialectal variations. After 10,000 training steps, the final loss converged to 0.1083 with improved handling of Arabic-specific linguistic phenomena. Why it matters: This demonstrates a resource-efficient approach for creating specialized Arabic language models, democratizing access to advanced NLP technologies.

RightNow-Arabic-0.5B-Turbo: An Open Sub-1B Arabic Language Model via Vocabulary Injection and Edge-First Deployment

arXiv · Apr 10

RightNow-Arabic-0.5B-Turbo is a new 518M-parameter Arabic-specialized decoder LLM, built on Qwen2.5-0.5B, designed to bridge the gap between small multilingual and large Arabic-specialized models. Its development pipeline included adding 27,032 Arabic tokens via vocabulary injection, continued pretraining on 504M Arabic tokens, and fine-tuning with supervised instruction and direct preference optimization. The model achieved a 35.9% mean accuracy on three Arabic benchmarks (COPA-ar, Arabic HellaSwag, ArabicMMLU), outperforming all same-class open models and recovering 67% of SILMA-9B's mean accuracy at 1/18 the parameters, with all code and weights publicly released. Why it matters: This model significantly advances efficient Arabic NLP by providing a powerful, specialized sub-1B LLM suitable for edge deployment, making advanced Arabic AI more accessible and performant on resource-constrained devices.

Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)

arXiv · Dec 20

The first Workshop on Language Models for Low-Resource Languages (LoResLM 2025) was held in Abu Dhabi as part of COLING 2025. It provided a forum for researchers to share work on language models for low-resource languages. The workshop accepted 35 papers from 52 submissions, covering diverse languages and research areas.