Topics ›

NLP

Natural language processing research from GCC institutions, covering Arabic NLP, multilingual models, text classification, named entity recognition, and machine translation.

201–250 articles · Page 5 RSS ↗

The Cylindrical Representation Hypothesis for Language Model Steering

arXiv · May 3 · LLM NLP

Researchers from MBZUAI have proposed the Cylindrical Representation Hypothesis (CRH) to explain the instability and unpredictability observed in large language model steering. CRH relaxes the orthogonality assumption of the existing Linear Representation Hypothesis, positing a cylindrical structure where a central axis captures concept differences and a surrounding normal plane controls steering sensitivity. The hypothesis suggests that the intrinsic uncertainty in identifying specific sensitive sectors within this normal plane accounts for why steering outcomes frequently fluctuate even with well-aligned directions. Why it matters: This research offers a more robust theoretical framework for understanding and potentially improving the control and reliability of large language models.

Instruction-Guided Poetry Generation in Arabic and Its Dialects

arXiv · Apr 30 · NLP LLM

Researchers at MBZUAI have developed a new method for controllable poetry generation in Arabic and its dialects, moving beyond traditional analysis tasks for Arabic poetry within Large Language Models (LLMs). They introduce a large-scale, instruction-based dataset in Modern Standard Arabic (MSA) and various Arabic dialects, enabling LLMs to perform tasks like writing, revising, and continuing poems based on user criteria. Experiments show that fine-tuning LLMs on this dataset results in models capable of generating poetry aligned with user requirements, validated by automated metrics and human evaluation. Why it matters: This work represents a significant advancement in Arabic Natural Language Processing, offering tools for creative expression and cultural preservation while opening new avenues for user-guided content generation in culturally rich text forms.

New Google AI feature lets your data power smarter answers — now in UAE - Gulf News

Gulf News News · Apr 15 · Product LLM

Google has introduced a new AI feature in the United Arab Emirates, designed to provide more intelligent and personalized answers to users. This feature reportedly leverages user data, with consent, to enhance its responsiveness and relevance. The rollout in the UAE signifies the expansion of Google's advanced AI services into the Middle East market. Why it matters: This launch represents increased access to sophisticated AI tools for consumers and businesses in the UAE, potentially accelerating AI adoption and innovation in the local digital economy.

Severity-Aware Weighted Loss for Arabic Medical Text Generation

arXiv · Apr 7 · NLP LLM

Researchers proposed a severity-aware weighted loss method to fine-tune Arabic language models for medical text generation, prioritizing severe clinical cases. This approach utilizes soft severity probabilities, derived from an AraBERT-based classifier, to dynamically scale token-level loss contributions during optimization on the MAQA dataset. The method consistently improved performance across ten Arabic LLMs, with AraGPT2-Base increasing from 54.04% to 66.14% and AraGPT2-Medium from 59.16% to 67.18%. Why it matters: This novel fine-tuning strategy addresses a critical limitation in medical AI by enhancing the safety and reliability of Arabic medical large language models, particularly in high-stakes clinical scenarios.

State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

arXiv · Apr 7 · NLP LLM

Arabic-DeepSeek-R1 is an application-driven, open-source Arabic Large Language Model (LLM) that has achieved a new state-of-the-art (SOTA) across the Open Arabic LLM Leaderboard (OALL). The model utilizes a sparse Mixture-of-Experts (MoE) backbone and a four-phase Chain-of-Thought (CoT) distillation scheme, which incorporates Arabic-specific linguistic verification and regional ethical norms. It records the highest average score on the OALL suite and outperforms proprietary frontier systems like GPT-5.1 on a majority of benchmarks evaluating comprehensive Arabic language-specific tasks. Why it matters: This work offers a validated and cost-effective framework for developing high-performing, culturally-grounded AI for under-represented languages, addressing the digital equity gap.

Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

arXiv · Apr 6 · NLP Research

Researchers have developed OmniScore, a family of deterministic learned metrics designed to evaluate generative text as an alternative to Large Language Models (LLMs) used as judges. OmniScore leverages small parameter models (<1B) and was trained on approximately 564,000 synthetic instances across 107 languages, then evaluated using 8,617 manually annotated instances. It approximates LLM-judge behavior while offering low latency and consistency for various evaluation settings like reference-based and source-grounded assessments in tasks like QA, translation, and summarization. Why it matters: This development provides a practical, scalable, and reproducible method for multilingual generative text evaluation, addressing key limitations of LLM-as-a-judge approaches and offering significant benefits for AI development in linguistically diverse regions.

Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation

arXiv · Apr 3 · LLM NLP

QIMMA is introduced as a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. It employs a multi-model assessment pipeline combining automated LLM judgment with human review to identify and resolve quality issues in established Arabic benchmarks. The resulting evaluation suite comprises over 52,000 samples, predominantly grounded in native Arabic content, with transparent implementation via LightEval and EvalPlus. Why it matters: This initiative provides a more reliable and reproducible foundation for evaluating Arabic Large Language Models, addressing critical quality concerns in existing benchmarks.

Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

arXiv · Mar 25 · NLP LLM

Researchers developed a retrieval-augmented generation (RAG) framework to improve Arabic Large Language Models (LLMs) in understanding complex historical and religious texts like the Quran and Hadith. This framework grounds LLMs in the Doha Historical Dictionary of Arabic (DHDA) through hybrid retrieval and intent-based routing. The approach significantly boosted the accuracy of Arabic-native LLMs such as Fanar and ALLaM to over 85%, closing the performance gap with proprietary models like Gemini. Why it matters: This research offers a novel method for enhancing Arabic NLP capabilities for historically nuanced texts, demonstrating the value of integrating diachronic lexicographic resources into RAG systems for deeper language understanding.

TII Launches Falcon Perception, A New Multimodal AI Model That Helps Machines See and Understand the World – with Efficiency that Rivals Larger Models

TII · Mar 17 · CV NLP

The Technology Innovation Institute (TII) has launched Falcon Perception, a new 600-million-parameter multimodal AI model. This model offers competitive performance in object segmentation, dense visual understanding, and document intelligence, rivalling larger systems like Meta’s SAM3 and Alibaba’s Qwen with significantly greater efficiency. Falcon Perception unifies image and language processing in a single architecture, designed for real-world deployment in compute-constrained environments. Why it matters: This development positions the UAE among leading nations in advanced multimodal AI, which is crucial for applications in robotics, advanced manufacturing, and autonomous platforms.

Introducing the Open Arabic LLM Leaderboard: Empowering the Arabic Language Modeling Community

TII · Mar 17 · NLP LLM

The Open Arabic LLM Leaderboard (OALL) has been launched to benchmark Arabic language models, addressing the gap in resources for non-English NLP. It incorporates datasets like AlGhafa, ACVA, and translated versions of MMLU and EXAMS from the AceGPT suite. The leaderboard uses normalized log likelihood accuracy for tasks, built around HuggingFace’s LightEval framework. Why it matters: This initiative promotes research and development in Arabic NLP, serving over 380 million Arabic speakers by enhancing the evaluation and improvement of Arabic LLMs.

Technology Innovation Institute Announces Launch of NOOR, the World’s Largest Arabic NLP Model

TII · Mar 17 · NLP LLM

Technology Innovation Institute (TII) in Abu Dhabi, in collaboration with LightOn, has launched NOOR, a 10 billion parameter Arabic natural language processing (NLP) model. The model was trained on a large, high-quality cross-domain Arabic dataset including web data, books, poetry, news, and technical information. It enables applications in automated summarization, chatbots, and personalized marketing. Why it matters: NOOR represents a significant advancement in Arabic NLP, potentially enabling more sophisticated AI applications tailored to the Arabic language and regional needs.

Abu Dhabi’s TII Launches Falcon-H1 Arabic, Establishing the World’s Leading Arabic AI Model

TII · Mar 17 · NLP LLM

Abu Dhabi’s Technology Innovation Institute (TII) has launched Falcon-H1 Arabic, a new large language model based on a hybrid Mamba-Transformer architecture. The Falcon-H1 family comes in 3B, 7B, and 34B parameter sizes and outperforms existing models on the Open Arabic LLM Leaderboard (OALL). The model features improvements in data quality, dialect coverage, and long-context stability. Why it matters: This release strengthens the UAE's position in Arabic AI and provides a high-performing model tailored to the linguistic and cultural needs of the region.

SectEval: Evaluating the Latent Sectarian Preferences of Large Language Models

arXiv · Mar 13 · NLP LLM

The paper introduces SectEval, a new benchmark to evaluate sectarian biases in LLMs concerning Sunni and Shia Islam, available in English and Hindi. Results show significant inconsistencies in LLM responses based on language, with some models favoring Shia responses in English but Sunni in Hindi. Location-based experiments further reveal that advanced models adapt their responses based on the user's claimed country, while smaller models exhibit a consistent Sunni-leaning bias.

Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation

arXiv · Mar 2 · NLP LLM

MBZUAI researchers have developed an automatic interview system that uses LLMs to elicit nuanced, role-specific information from job candidates, improving early-stage hiring decisions. The system updates its belief about an applicant's rubric-oriented latent traits in a calibrated way based on their interview performance. Evaluation on simulated interviews showed the system's belief converges towards the simulated applicants' constructed ability levels.

ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

arXiv · Feb 21 · NLP LLM

The paper introduces ArabicNumBench, a benchmark for evaluating LLMs on Arabic number reading using both Eastern and Western Arabic numerals. It evaluates 71 models from 10 providers on 210 number reading tasks, using zero-shot, zero-shot CoT, few-shot, and few-shot CoT prompting strategies. The results show substantial performance variation, with few-shot CoT prompting achieving 2.8x higher accuracy than zero-shot approaches. Why it matters: The benchmark establishes baselines for Arabic number comprehension and provides guidance for model selection in production Arabic NLP systems.

ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning

arXiv · Feb 19 · NLP Arabic AI

The paper introduces ALPS (Arabic Linguistic & Pragmatic Suite), a diagnostic challenge set for evaluating deep semantics and pragmatics in Arabic NLP. The dataset contains 531 expert-curated questions across 15 tasks and 47 subtasks, designed to test morpho-syntactic dependencies and compositional semantics. Evaluation of 23 models, including commercial, open-source, and Arabic-native models, reveals that models struggle with fundamental morpho-syntactic dependencies, especially those reliant on diacritics. Why it matters: ALPS provides a valuable benchmark for evaluating the linguistic competence of Arabic NLP models, highlighting areas where current models fall short despite achieving high fluency.

From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

arXiv · Feb 10 · NLP LLM

Arabic Language Models (LMs) are primarily pretrained on Modern Standard Arabic (MSA), with an expectation of transferring to diverse Arabic dialects for real-world applications. This work explores cross-lingual transfer in Arabic LMs using probing on three Natural Language Processing (NLP) tasks and representational similarity. The findings indicate that transfer is possible but disproportionate across dialects, with some evidence of negative interference in models trained to support all Arabic dialects. Why it matters: This research highlights crucial challenges for building robust Arabic AI systems that effectively handle the significant linguistic diversity of the Arab world.

SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

arXiv · Feb 3 · NLP LLM

The paper introduces SalamahBench, a new benchmark for evaluating the safety of Arabic Language Models (ALMs). The benchmark comprises 8,170 prompts across 12 categories aligned with the MLCommons Safety Hazard Taxonomy. Five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, were evaluated using the benchmark. Why it matters: The benchmark enables standardized, category-aware safety evaluation, highlighting the necessity of specialized safeguard mechanisms for robust harm mitigation in ALMs.

DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding

arXiv · Jan 27 · CV NLP

MBZUAI researchers introduce DuwatBench, a new benchmark for multimodal understanding of Arabic calligraphy. The dataset contains 1,272 samples across six calligraphic styles with detailed annotations to evaluate visual-text alignment. Evaluation of 13 multimodal models reveals challenges in processing calligraphic variations and artistic distortions, highlighting the need for culturally grounded AI research.

Generative AI in Saudi Arabia: A National Survey of Adoption, Risks, and Public Perceptions

arXiv · Jan 26 · Research Policy

A national survey in Saudi Arabia of 330 participants reveals that 93% are actively using Generative AI, primarily for text-based tasks, while awareness and understanding remain uneven. Participants recognize benefits like productivity but caution against risks such as privacy, misinformation, and ethical misuse. The study highlights the need for AI literacy, culturally aligned solutions, and stronger frameworks for responsible deployment in Saudi Arabia.

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

arXiv · Jan 13 · LLM RL

The paper introduces Yet another Policy Optimization (YaPO), a reference-free method for learning sparse steering vectors in the latent space of a Sparse Autoencoder (SAE) to steer LLMs. By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Experiments show YaPO converges faster, achieves stronger performance, exhibits improved training stability and preserves general knowledge compared to dense steering baselines.

Community-Based Early-Stage Chronic Kidney Disease Screening using Explainable Machine Learning for Low-Resource Settings

arXiv · Jan 3 · Research Healthcare

This paper introduces an explainable machine learning framework for early-stage chronic kidney disease (CKD) screening, specifically designed for low-resource settings in Bangladesh and South Asia. The framework utilizes a community-based dataset from Bangladesh and evaluates multiple ML classifiers with feature selection techniques. Results show that the ML models achieve high accuracy and sensitivity, outperforming existing screening tools and demonstrating strong generalizability across independent datasets from India, the UAE, and Bangladesh.

Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

arXiv · Dec 25 · NLP Arabic AI

The paper introduces Ara-HOPE, a human-centric post-editing evaluation framework for Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation. Ara-HOPE includes a five-category error taxonomy and a decision-tree annotation protocol designed to address the challenges of dialect-specific MT errors. Evaluation of Jais, GPT-3.5, and NLLB-200 shows dialect-specific terminology and semantic preservation remain key challenges. Why it matters: The new framework and public dataset will help improve the evaluation and development of dialect-aware MT systems for Arabic.

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

arXiv · Dec 22 · CV Research

The paper introduces the Prism Hypothesis, which posits a correspondence between an encoder's feature spectrum and its functional role, with semantic encoders capturing low-frequency components and pixel encoders retaining high-frequency information. Based on this, the authors propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details using a frequency-band modulator. Experiments on ImageNet and MS-COCO demonstrate that UAE effectively unifies semantic abstraction and pixel-level fidelity, achieving state-of-the-art performance.

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv · Dec 20 · NLP Arabic AI

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv · Dec 18 · CV NLP

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region

arXiv · Dec 16 · CV Research

A new study compares vision-language models (VLMs) to YOLOv8 for wastewater treatment plant (WWTP) identification in satellite imagery across the MENA region. VLMs like Gemma-3 demonstrate superior zero-shot performance compared to YOLOv8, trained on a dataset of 83,566 satellite images from Egypt, Saudi Arabia, and UAE. The research suggests VLMs offer a scalable, annotation-free alternative for remote sensing of WWTPs.

FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models

arXiv · Nov 24 · NLP LLM

The paper introduces FanarGuard, a bilingual moderation filter for Arabic and English language models that considers both safety and cultural alignment. A dataset of 468K prompt-response pairs was created and scored by LLM judges on harmlessness and cultural awareness to train the filter. The first benchmark targeting Arabic cultural contexts was developed to evaluate cultural alignment. Why it matters: FanarGuard advances context-sensitive AI safeguards by integrating cultural awareness into content moderation, addressing a critical gap in current alignment techniques.

Cross-Document Topic-Aligned Chunking for Retrieval-Augmented Generation

arXiv · Nov 8 · NLP LLM

This paper introduces Cross-Document Topic-Aligned (CDTA) chunking to address knowledge fragmentation in Retrieval-Augmented Generation (RAG) systems. CDTA identifies topics across documents, maps segments to topics, and synthesizes them into unified chunks. Experiments on HotpotQA and UAE legal texts show that CDTA improves faithfulness and citation accuracy compared to existing chunking methods, especially for complex queries requiring multi-hop reasoning.

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

arXiv · Nov 2 · NLP LLM

A new method is proposed to reduce the verbosity of LLMs in step-by-step reasoning by retaining moderately easy problems during Reinforcement Learning with Verifiable Rewards (RLVR) training. This approach acts as an implicit length regularizer, preventing the model from excessively increasing output length on harder problems. Experiments using Qwen3-4B-Thinking-2507 show the model achieves baseline accuracy with nearly twice shorter solutions.

LLM-based Multi-class Attack Analysis and Mitigation Framework in IoT/IIoT Networks

arXiv · Oct 30 · Research NLP

This paper introduces a framework that combines machine learning for multi-class attack detection in IoT/IIoT networks with large language models (LLMs) for attack behavior analysis and mitigation suggestion. The framework uses role-play prompt engineering with RAG to guide LLMs like ChatGPT-o3 and DeepSeek-R1, and introduces new evaluation metrics for quantitative assessment. Experiments using Edge-IIoTset and CICIoT2023 datasets showed Random Forest as the best detection model and ChatGPT-o3 outperforming DeepSeek-R1 in attack analysis and mitigation.

Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding

arXiv · Oct 27 · NLP LLM

MASARAT SA has developed Mubeen, a proprietary Arabic language model specializing in Arabic linguistics, Islamic studies, and cultural heritage. Mubeen was trained using native Arabic sources, including digitized historical manuscripts processed via a proprietary Arabic OCR engine. The model employs a Practical Closure Architecture to improve user intent understanding and provide decisive guidance. Why it matters: Mubeen addresses the utility gap in current Arabic LLMs by focusing on native Arabic data and cultural authenticity, which is critical for heritage preservation and alignment with Saudi Vision 2030.

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

arXiv · Oct 15 · NLP LLM

This survey paper analyzes over 40 benchmarks used to evaluate Arabic large language models, categorizing them into Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. It identifies progress in benchmark diversity but also highlights gaps like limited temporal evaluation and cultural misalignment. The paper also examines methods for creating benchmarks, including native collection, translation, and synthetic generation. Why it matters: The survey provides a comprehensive reference for Arabic NLP research and offers recommendations for future benchmark development to better align with cultural contexts.

Developing and Validating the Arabic Version of the Attitudes Toward Large Language Models Scale

arXiv · Oct 14 · NLP LLM

This paper presents the development and validation of an Arabic version of the Attitudes Toward Large Language Models (AT-GLLM and AT-PLLM) scales, adapted from the original English versions. The study involved translating the scales and testing them on a sample of 249 Arabic-speaking adults. The translated scales demonstrated strong psychometric properties, including a two-factor structure, measurement invariance across genders, and good reliability and validity. Why it matters: This provides a culturally relevant tool for assessing attitudes toward LLMs in the Arab world, crucial for localized research and policy-making in the rapidly growing field of Arabic AI.

ALARB: An Arabic Legal Argument Reasoning Benchmark

arXiv · Oct 1 · NLP LLM

Researchers introduce ALARB, a new benchmark for evaluating reasoning in Arabic LLMs using 13K Saudi commercial court cases. The benchmark includes tasks like verdict prediction, reasoning chain completion, and identification of relevant regulations. Instruction-tuning a 12B parameter model on ALARB achieves performance comparable to GPT-4o in verdict prediction and generation.

Gender Stereotypes in Professional Roles Among Saudis: An Analytical Study of AI-Generated Images Using Language Models

arXiv · Sep 25 · Research Ethics

The study analyzes over 1,000 images generated by ImageFX, DALL-E V3, and Grok for 56 Saudi professions, finding significant gender imbalances and cultural inaccuracies. DALL-E V3 exhibited the strongest gender stereotyping, with 96% male depictions, particularly in leadership and technical roles. The research underscores the need for diverse training data and culturally sensitive evaluation to ensure equitable AI outputs that accurately reflect Saudi Arabia's labor market and culture.

Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

arXiv · Sep 17 · NLP LLM

The Hala technical report introduces a family of Arabic-centric instruction and translation models developed using a translate-and-tune pipeline. A strong Arabic-English teacher model is compressed to FP8 and used to create bilingual supervision data. The LFM2-1.2B model is fine-tuned on this data and used to translate English instruction sets into Arabic, creating a million-scale corpus. Why it matters: The release of models, data, evaluation tools, and recipes will accelerate research and development in Arabic NLP, providing valuable resources for the community.

Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records

arXiv · Sep 12 · NLP LLM

Researchers address the challenge of limited Arabic medical dialogue data by generating 80,000 synthetic question-answer pairs using ChatGPT-4o and Gemini 2.5 Pro, expanding an initial dataset of 20,000 records. They fine-tuned five LLMs, including Mistral-7B and AraGPT2, and evaluated performance using BERTScore and expert review. Results showed that training with ChatGPT-4o-generated data led to higher F1-scores and fewer hallucinations across models. Why it matters: This demonstrates the potential of synthetic data augmentation to improve domain-specific Arabic language models, particularly for low-resource medical NLP applications.

AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

arXiv · Sep 4 · NLP LLM

The paper introduces AraHalluEval, a new framework for evaluating hallucinations in Arabic and multilingual large language models (LLMs). The framework uses 12 fine-grained hallucination indicators across generative question answering and summarization tasks, evaluating 12 LLMs including Arabic-specific, multilingual, and reasoning-based models. Results show factual hallucinations are more common than faithfulness errors, with the Arabic model Allam showing lower hallucination rates. Why it matters: This work addresses a critical gap in Arabic NLP by providing a comprehensive tool for assessing and mitigating hallucination in LLMs, which is essential for reliable AI applications in the Arabic-speaking world.

SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

arXiv · Sep 4 · CV NLP

Researchers from MBZUAI have introduced SPECS, a new reference-free evaluation metric for long image captions that modifies CLIP to emphasize specificity. SPECS aims to improve the correlation with human judgment while maintaining computational efficiency compared to LLM-based metrics. The proposed approach is intended for iterative use during image captioning model development, offering a practical alternative to existing methods.

Continuous Saudi Sign Language Recognition: A Vision Transformer Approach

arXiv · Sep 3 · NLP CV

The researchers introduce KAU-CSSL, the first continuous Saudi Sign Language (SSL) dataset focusing on complete sentences. They propose a transformer-based model using ResNet-18 for spatial feature extraction and a Transformer Encoder with Bidirectional LSTM for temporal dependencies. The model achieved 99.02% accuracy in signer-dependent mode and 77.71% in signer-independent mode, advancing communication tools for the SSL community.

Dhati+: Fine-tuned Large Language Models for Arabic Subjectivity Evaluation

arXiv · Aug 27 · NLP Arabic AI

This paper introduces AraDhati+, a new comprehensive dataset for Arabic subjectivity analysis created by combining existing datasets like ASTD, LABR, HARD, and SANAD. The researchers fine-tuned Arabic language models including XLM-RoBERTa, AraBERT, and ArabianGPT on AraDhati+ for subjectivity classification. An ensemble decision approach achieved 97.79% accuracy. Why it matters: The work addresses the under-resourced nature of Arabic NLP by providing a new dataset and demonstrating strong results in subjectivity classification, advancing sentiment analysis capabilities for the Arabic language.

Saudi Arabia launches HUMAIN Chat, first conversational Arabic AI app - Al Arabiya English

SPA News · Aug 26 · Product Arabic AI

Saudi Arabia has launched HUMAIN Chat, a new conversational artificial intelligence application. The platform is being marketed as the first conversational Arabic AI app developed in the region. This launch signifies a key step in the Kingdom's efforts to advance its digital capabilities and AI offerings. Why it matters: The introduction of a dedicated Arabic conversational AI platform can significantly enhance digital interaction for Arabic speakers and accelerate the development of localized AI solutions in the Middle East.

UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat

arXiv · Aug 24 · LLM Arabic AI

This paper presents a UI-level evaluation of ALLaM-34B, an Arabic-centric LLM developed by SDAIA and deployed in the HUMAIN Chat service. The evaluation used a prompt pack spanning various Arabic dialects, code-switching, reasoning, and safety, with outputs scored by frontier LLM judges. Results indicate strong performance in generation, code-switching, MSA handling, reasoning, and improved dialect fidelity, positioning ALLaM-34B as a robust Arabic LLM suitable for real-world use.

QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning

arXiv · Aug 20 · NLP LLM

The QU-NLP team presented their approach to the QIAS 2025 shared task on Islamic Inheritance Reasoning, fine-tuning the Fanar-1-9B model using LoRA and integrating it into a RAG pipeline. Their system achieved an accuracy of 0.858 on the final test, outperforming models like GPT 4.5, LLaMA, and Mistral in zero-shot settings. The system particularly excelled in advanced reasoning, achieving 97.6% accuracy. Why it matters: This demonstrates the effectiveness of domain-specific fine-tuning and retrieval augmentation for Arabic LLMs in complex reasoning tasks, even surpassing frontier models.

Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation

arXiv · Aug 19 · NLP LLM

This paper introduces Saudi-Dialect-ALLaM, a LoRA fine-tuned version of the Saudi Arabian foundation model ALLaM-7B-Instruct-preview, designed to improve the generation of Saudi dialects (Najdi and Hijazi). The model is trained on a private dataset of 5,466 synthetic instruction-response pairs, with two variants explored: Dialect-Token and No-Token training. Results indicate that the Dialect-Token model achieves superior dialect control and fidelity compared to generic instruction models, although the dataset and model weights are not released.

Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks

arXiv · Aug 13 · NLP LLM

This paper benchmarks the performance of large language models (LLMs) on Arabic medical natural language processing tasks using the AraHealthQA dataset. The study evaluated LLMs in multiple-choice question answering, fill-in-the-blank, and open-ended question answering scenarios. The results showed that a majority voting solution using Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3 achieved 77% accuracy on MCQs, while other LLMs achieved a BERTScore of 86.44% on open-ended questions. Why it matters: The research highlights both the potential and limitations of current LLMs in Arabic clinical contexts, providing a baseline for future improvements in Arabic medical AI.

20 million words and counting: UAE’s grand plan to power Arabic with AI - Gulf Business

WAM News · Aug 11 · NLP LLM

The UAE government is developing large language models (LLMs) specifically for the Arabic language, with a target training dataset of 20 million words. This initiative aims to overcome the underrepresentation of Arabic in existing AI models. The project seeks to enhance AI's ability to understand and generate nuanced Arabic content. Why it matters: A national Arabic LLM can enable culturally relevant AI applications across various sectors in the region, from education to government services.

UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

arXiv · Jul 29 · LLM Research

Researchers introduce UnsafeChain, a new safety alignment dataset designed to improve the safety of large reasoning models (LRMs) by focusing on 'hard prompts' that elicit harmful outputs. The dataset identifies and corrects unsafe completions into safe responses, exposing models to unsafe behaviors and guiding their correction. Fine-tuning LRMs on UnsafeChain demonstrates enhanced safety and preservation of general reasoning ability compared to existing datasets like SafeChain and STAR-1.

Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

arXiv · Jul 27 · NLP LLM

This paper explores Dialectal Arabic (DA) to Modern Standard Arabic (MSA) machine translation using prompting and fine-tuning techniques for Levantine, Egyptian, and Gulf dialects. The study found that few-shot prompting outperformed zero-shot and chain-of-thought methods across six large language models, with GPT-4o achieving the highest performance. A quantized Gemma2-9B model achieved a chrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Why it matters: The research provides a resource-efficient pipeline for DA-MSA translation, enabling more inclusive language technologies by addressing the challenges posed by dialectal variations in Arabic.

← Newer Older →