Skip to content
GCC AI Research

Search

Results for "Machine Translation"

Egyptian Arabic to English Statistical Machine Translation System for NIST OpenMT'2015

arXiv ·

This paper describes the QCRI-Columbia-NYUAD group's Egyptian Arabic-to-English statistical machine translation system submitted to the NIST OpenMT'2015 competition. The system used tools like 3arrib and MADAMIRA for processing and standardizing informal dialectal Arabic. The system was trained using phrase-based SMT with features such as operation sequence model, class-based language model and neural network joint model. Why it matters: The work demonstrates advances in machine translation for dialectal Arabic, a challenging but important area for regional communication and NLP research.

QCRI Machine Translation Systems for IWSLT 16

arXiv ·

This paper describes QCRI's machine translation systems for the IWSLT 2016 evaluation campaign, focusing on Arabic-English and English-Arabic tracks. They built both Phrase-based and Neural machine translation models. A Neural MT system, trained by stacking data from different genres through fine-tuning, and applying ensemble over 8 models, outperformed a strong phrase-based system by 2 BLEU points in the Arabic->English direction. Why it matters: The research highlights the early promise of neural machine translation for Arabic language pairs, demonstrating its potential to surpass traditional methods.

Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

arXiv ·

This paper explores Dialectal Arabic (DA) to Modern Standard Arabic (MSA) machine translation using prompting and fine-tuning techniques for Levantine, Egyptian, and Gulf dialects. The study found that few-shot prompting outperformed zero-shot and chain-of-thought methods across six large language models, with GPT-4o achieving the highest performance. A quantized Gemma2-9B model achieved a chrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Why it matters: The research provides a resource-efficient pipeline for DA-MSA translation, enabling more inclusive language technologies by addressing the challenges posed by dialectal variations in Arabic.

Simultaneous Masking, Not Prompting Optimization: A Paradigm Shift in Fine-tuning LLMs for Simultaneous Translation

arXiv ·

This paper introduces SimulMask, a new paradigm for fine-tuning large language models (LLMs) for simultaneous translation. SimulMask utilizes a novel attention masking approach that models simultaneous translation during fine-tuning by masking attention for a desired decision policy. Applied to a Falcon LLM on the IWSLT 2017 dataset, SimulMask achieves improved translation quality compared to state-of-the-art prompting optimization strategies across five language pairs while reducing computational cost. Why it matters: The proposed method offers a more efficient way to adapt LLMs for real-time translation, potentially enhancing multilingual communication tools and services.

NADI 2024: The Fifth Nuanced Arabic Dialect Identification Shared Task

arXiv ·

The fifth Nuanced Arabic Dialect Identification (NADI) 2024 shared task aimed to advance Arabic NLP through dialect identification and dialect-to-MSA machine translation. 51 teams registered, with 12 participating and submitting 76 valid submissions across three subtasks. The winning teams achieved 50.57 F1 for multi-label dialect identification, 0.1403 RMSE for dialectness level identification, and 20.44 BLEU for dialect-to-MSA translation. Why it matters: The results highlight the continued challenges in Arabic dialect processing and provide a benchmark for future research in this area.

Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging

arXiv ·

This paper explores language-independent alternatives to morphological segmentation for Arabic NLP using data-driven sub-word units, characters as a unit of learning, and word embeddings learned using a character CNN. The study evaluates these methods on machine translation and POS tagging tasks. Results show these methods achieve performance close to or surpassing state-of-the-art approaches. Why it matters: By offering simpler, more adaptable segmentation techniques, this research can help improve Arabic NLP applications across diverse domains and dialects.

A Panoramic Survey of Natural Language Processing in the Arab World

arXiv ·

This survey paper reviews the landscape of Natural Language Processing (NLP) research and applications in the Arab world. It discusses the unique challenges posed by the Arabic language, such as its morphological complexity and dialectal diversity. The paper also presents a historical overview of Arabic NLP and surveys various research areas, including machine translation, sentiment analysis, and speech recognition. Why it matters: The survey provides a comprehensive resource for researchers and practitioners interested in the current state and future directions of Arabic NLP, a field critical for enabling AI technologies to serve Arabic-speaking communities.