Skip to content
GCC AI Research

Topics

NLP

Natural language processing research from GCC institutions, covering Arabic NLP, multilingual models, text classification, named entity recognition, and machine translation.

201–250 articles · Page 5 RSS ↗

Evaluation of Small Language Models for Arabic Language Processing

arXiv · · NLP LLM

A new paper evaluated twelve Small Language Models (SLMs) on Arabic natural language processing tasks, utilizing a benchmark of 240 Arabic test items across eight domains and ten language skills. The models were assessed in a zero-shot setting, with responses scored using a multi-model LLM-as-a-judge framework involving GPT-4.1 Mini, Claude Haiku 4.5, and DeepSeek-Chat. Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic, with results suggesting that strong Arabic alignment and instruction-following are crucial for performance. Why it matters: This benchmark offers a standardized method for evaluating compact Arabic language models, guiding future development towards more efficient, reliable, and culturally relevant Arabic AI systems.

Meet Zayed: UAE’s AI-powered spokesperson for the Presidential Court - Gulf News

Gulf News News · · Product Arabic AI

The UAE Presidential Court has unveiled "Zayed," an AI-powered spokesperson designed to represent the court. This new digital entity is expected to deliver official announcements and engage with the public, leveraging advanced artificial intelligence technologies. The initiative represents a significant step in the UAE's efforts to integrate AI into government communications and public services. Why it matters: This deployment showcases the UAE's proactive adoption of AI in high-profile public sector roles, potentially setting a precedent for AI integration in government communication across the region.

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

arXiv · · NLP Arabic AI

This paper presents a methodology for digitizing and encoding the Al-Mawrid Arabic-English dictionary, transforming it into a standardized computational lexicon using the ISO Lexical Markup Framework (LMF) and TEI Lex-0 guidelines. The research, based on an empirical analysis of the letter Ayn (4.6% of the dictionary), achieved a structural parsing accuracy of 91%. Quantitative evaluation showed high performance for information extraction rules, including 85% precision and 98% recall for synonyms. Why it matters: This work addresses a significant gap in Arabic lexical infrastructure, providing an interoperable, machine-tractable resource and a reproducible workflow for retro-digitizing complex legacy bilingual lexicons for Arabic NLP and Digital Humanities.

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

arXiv · · Research LLM

Researchers have introduced BloomBench, a new cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for Vision-Language Models (VLMs), as part of the Almieyar benchmarking series. Grounded in Bloom's Taxonomy, it systematically evaluates six levels of cognition—Remember, Understand, Apply, Analyze, Evaluate, Create—through carefully designed image-question-answer tasks. A comprehensive study using BloomBench revealed that state-of-the-art VLMs exhibit strong semantic understanding but struggle significantly with factual recall and creative synthesis, alongside a critical performance gap between Arabic and English. Why it matters: This benchmark provides a crucial tool for diagnosing cognitive weaknesses in current VLMs and lays the groundwork for developing more cognitively aligned and inclusive multimodal AI, particularly for cross-lingual applications.

An NLP-Driven Framework for Curriculum-Labor Market Alignment: Schema-Constrained LLM Extraction, ESCO-Anchored Semantic Matching, and Multi-Dimensional Gap Quantification

arXiv · · NLP LLM

Researchers proposed a four-stage NLP framework combining schema-constrained LLM extraction, Sentence-BERT (SBERT) alignment with ESCO, an adjudication protocol, and a verification mechanism for curriculum-labor market alignment. The framework was instantiated for the ABET-accredited BSc Computer Science program at the United Arab Emirates University (UAEU), extracting 400 competency records from the study plan and aligning them with 30 job postings. The extractor achieved a Cohen's kappa of 0.79 on the skill slot and surfaced interpretable supply-demand gaps in general, transversal, algorithms, and software engineering skills, with a minimal gap in AI and data science. Why it matters: This framework provides a robust, NLP-driven method to identify crucial skill gaps in higher education curricula, directly supporting quality assurance and workforce development initiatives in the region.

UAE spotlights Agentic AI as the future of government communication, launches Government Media Content Guideline - Economy Middle East

The National · · Policy Arabic AI

The UAE has spotlighted Agentic AI as a key element for the future of government communication. Concurrently, the government launched the Government Media Content Guideline to regulate content in this evolving landscape. This initiative underscores the UAE's strategic move to integrate advanced AI technologies into its public sector operations. Why it matters: This development signifies a proactive governmental approach to AI adoption and regulation, potentially setting a precedent for other nations in the Middle East in managing AI-powered public communication.

Uncovering Temporal Framing in the News

arXiv · · NLP Research

Researchers from MBZUAI have proposed a new taxonomy of eight temporal frames and studied their persuasive use in news discourse. They created a multilingual dataset by expertly annotating 458 English and German news articles, identifying over 2,000 temporally framed sentences and approximately 3,000 annotations. Their experiments demonstrated that temporal framing is learnable at the sentence level, with supervised models significantly outperforming zero-shot classification approaches. Why it matters: This research provides a valuable dataset and methodology for understanding how time-related language shapes interpretation in news, contributing to advancements in NLP for media analysis and potentially countering disinformation.

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

arXiv · · NLP Arabic AI

ArabDiscrim is a new corpus comprising 293,000 public Arabic Facebook posts from 2014 to 2024, specifically curated to discuss racism and discrimination. Unlike prior Twitter-centric datasets, it incorporates platform-native engagement signals, 200 curated terms with morphological regex families, and 20 discrimination axes. The resource also provides explicit attribution patterns and is released under a restricted research-use license for ethical compliance. Why it matters: This dataset provides a unique, ecologically valid foundation for fairness-oriented and platform-aware Arabic Natural Language Processing, moving beyond existing Twitter-centric resources.

Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems

arXiv · · NLP Arabic AI

This paper reflects on two decades of building NLP resources and research infrastructure for Arabic, an historically underserved language. The first decade focused on foundational linguistic infrastructure, while the second shifted towards computational social science and socially oriented applications. The authors highlight three lessons: dataset building is a social process, communities often matter more than shared tasks, and computational social science exposes challenges beyond traditional NLP training. Why it matters: The paper argues that the most difficult problems in developing NLP for underserved communities are social, institutional, and epistemic, offering critical insights for future research directions in Arabic AI.

JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media

arXiv · · NLP Arabic AI

Researchers have introduced JobArabi, a new large-scale corpus consisting of 20,528 Arabic job announcements collected from X between January 2024 and October 2025. The dataset was compiled using a linguistically informed query framework covering various Arabic recruitment expressions, offering metadata like timestamps and geolocation for detailed analysis. Quantitative analysis of JobArabi reveals sociolinguistic patterns, including persistent gendered hiring language, regional occupational demand variations, and emotional framing in recruitment messages. Why it matters: This corpus provides a valuable resource for research in Arabic NLP, computational social science, and digital labor studies, offering unique insights into labor market communication and linguistic change in the Arab world.

LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets

arXiv · · NLP LLM

Researchers developed an Arabic NLP framework designed for large-scale financial sentiment analysis specifically tailored to the Saudi market. The framework integrates official financial news and social media, constructing an 84K-sample Arabic financial corpus through a multi-stage pipeline encompassing data collection, cleaning, and sentiment annotation. It employs Transformer-based NER and a curated company lexicon to link textual mentions to canonical company identifiers, assigning five-class sentiment labels for analyzing sentiment dynamics relative to stock market behavior on the Saudi Exchange. Why it matters: This research addresses a critical gap in Arabic financial NLP resources, offering a scalable method to understand investor sentiment in a key Middle Eastern market.

NYU Abu Dhabi translates speech into sign language using AI - The National

The National · · Research NLP

Researchers at NYU Abu Dhabi have developed an AI system capable of translating spoken language into sign language. This innovative technology aims to enhance communication accessibility for individuals who are deaf or hard-of-hearing. The system leverages advancements in artificial intelligence, likely combining natural language processing for speech understanding and computer vision for sign generation. Why it matters: This development has the potential to significantly improve inclusion and communication for deaf communities within the Middle East and globally, bridging critical communication gaps.

The Cylindrical Representation Hypothesis for Language Model Steering

arXiv · · NLP LLM

Researchers have proposed the Cylindrical Representation Hypothesis (CRH) to address the instability and unpredictability observed in steering large language models, an issue not fully explained by the existing Linear Representation Hypothesis (LRH). CRH suggests that overlapping concept contributions lead to a sample-specific axis-orthogonal structure, comprising a central axis for concept generation and a surrounding normal plane for steering sensitivity. This framework identifies intrinsic uncertainty at the 'sensitive sector' level within the plane, providing a principled explanation for fluctuations in steering outcomes. Experiments verify the existence of this cylindrical structure and demonstrate CRH's practical utility in interpreting real-world model steering behavior, with code available on GitHub from mbzuai-nlp. Why it matters: This research from MBZUAI offers a crucial theoretical advancement in understanding and potentially improving the control and reliability of large language models.

The Cylindrical Representation Hypothesis for Language Model Steering

arXiv · · LLM NLP

Researchers from MBZUAI have proposed the Cylindrical Representation Hypothesis (CRH) to explain the instability and unpredictability observed in large language model steering. CRH relaxes the orthogonality assumption of the existing Linear Representation Hypothesis, positing a cylindrical structure where a central axis captures concept differences and a surrounding normal plane controls steering sensitivity. The hypothesis suggests that the intrinsic uncertainty in identifying specific sensitive sectors within this normal plane accounts for why steering outcomes frequently fluctuate even with well-aligned directions. Why it matters: This research offers a more robust theoretical framework for understanding and potentially improving the control and reliability of large language models.

Instruction-Guided Poetry Generation in Arabic and Its Dialects

arXiv · · NLP LLM

Researchers at MBZUAI have developed a new method for controllable poetry generation in Arabic and its dialects, moving beyond traditional analysis tasks for Arabic poetry within Large Language Models (LLMs). They introduce a large-scale, instruction-based dataset in Modern Standard Arabic (MSA) and various Arabic dialects, enabling LLMs to perform tasks like writing, revising, and continuing poems based on user criteria. Experiments show that fine-tuning LLMs on this dataset results in models capable of generating poetry aligned with user requirements, validated by automated metrics and human evaluation. Why it matters: This work represents a significant advancement in Arabic Natural Language Processing, offering tools for creative expression and cultural preservation while opening new avenues for user-guided content generation in culturally rich text forms.

New Google AI feature lets your data power smarter answers — now in UAE - Gulf News

Gulf News News · · Product LLM

Google has introduced a new AI feature in the United Arab Emirates, designed to provide more intelligent and personalized answers to users. This feature reportedly leverages user data, with consent, to enhance its responsiveness and relevance. The rollout in the UAE signifies the expansion of Google's advanced AI services into the Middle East market. Why it matters: This launch represents increased access to sophisticated AI tools for consumers and businesses in the UAE, potentially accelerating AI adoption and innovation in the local digital economy.

Severity-Aware Weighted Loss for Arabic Medical Text Generation

arXiv · · NLP LLM

Researchers proposed a severity-aware weighted loss method to fine-tune Arabic language models for medical text generation, prioritizing severe clinical cases. This approach utilizes soft severity probabilities, derived from an AraBERT-based classifier, to dynamically scale token-level loss contributions during optimization on the MAQA dataset. The method consistently improved performance across ten Arabic LLMs, with AraGPT2-Base increasing from 54.04% to 66.14% and AraGPT2-Medium from 59.16% to 67.18%. Why it matters: This novel fine-tuning strategy addresses a critical limitation in medical AI by enhancing the safety and reliability of Arabic medical large language models, particularly in high-stakes clinical scenarios.

State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

arXiv · · NLP LLM

Arabic-DeepSeek-R1 is an application-driven, open-source Arabic Large Language Model (LLM) that has achieved a new state-of-the-art (SOTA) across the Open Arabic LLM Leaderboard (OALL). The model utilizes a sparse Mixture-of-Experts (MoE) backbone and a four-phase Chain-of-Thought (CoT) distillation scheme, which incorporates Arabic-specific linguistic verification and regional ethical norms. It records the highest average score on the OALL suite and outperforms proprietary frontier systems like GPT-5.1 on a majority of benchmarks evaluating comprehensive Arabic language-specific tasks. Why it matters: This work offers a validated and cost-effective framework for developing high-performing, culturally-grounded AI for under-represented languages, addressing the digital equity gap.

Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

arXiv · · NLP Research

Researchers have developed OmniScore, a family of deterministic learned metrics designed to evaluate generative text as an alternative to Large Language Models (LLMs) used as judges. OmniScore leverages small parameter models (<1B) and was trained on approximately 564,000 synthetic instances across 107 languages, then evaluated using 8,617 manually annotated instances. It approximates LLM-judge behavior while offering low latency and consistency for various evaluation settings like reference-based and source-grounded assessments in tasks like QA, translation, and summarization. Why it matters: This development provides a practical, scalable, and reproducible method for multilingual generative text evaluation, addressing key limitations of LLM-as-a-judge approaches and offering significant benefits for AI development in linguistically diverse regions.

Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation

arXiv · · LLM NLP

QIMMA is introduced as a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. It employs a multi-model assessment pipeline combining automated LLM judgment with human review to identify and resolve quality issues in established Arabic benchmarks. The resulting evaluation suite comprises over 52,000 samples, predominantly grounded in native Arabic content, with transparent implementation via LightEval and EvalPlus. Why it matters: This initiative provides a more reliable and reproducible foundation for evaluating Arabic Large Language Models, addressing critical quality concerns in existing benchmarks.

Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

arXiv · · NLP LLM

Researchers developed a retrieval-augmented generation (RAG) framework to improve Arabic Large Language Models (LLMs) in understanding complex historical and religious texts like the Quran and Hadith. This framework grounds LLMs in the Doha Historical Dictionary of Arabic (DHDA) through hybrid retrieval and intent-based routing. The approach significantly boosted the accuracy of Arabic-native LLMs such as Fanar and ALLaM to over 85%, closing the performance gap with proprietary models like Gemini. Why it matters: This research offers a novel method for enhancing Arabic NLP capabilities for historically nuanced texts, demonstrating the value of integrating diachronic lexicographic resources into RAG systems for deeper language understanding.

TII Launches Falcon Perception, A New Multimodal AI Model That Helps Machines See and Understand the World – with Efficiency that Rivals Larger Models

TII · · CV NLP

The Technology Innovation Institute (TII) has launched Falcon Perception, a new 600-million-parameter multimodal AI model. This model offers competitive performance in object segmentation, dense visual understanding, and document intelligence, rivalling larger systems like Meta’s SAM3 and Alibaba’s Qwen with significantly greater efficiency. Falcon Perception unifies image and language processing in a single architecture, designed for real-world deployment in compute-constrained environments. Why it matters: This development positions the UAE among leading nations in advanced multimodal AI, which is crucial for applications in robotics, advanced manufacturing, and autonomous platforms.

Introducing the Open Arabic LLM Leaderboard: Empowering the Arabic Language Modeling Community

TII · · NLP LLM

The Open Arabic LLM Leaderboard (OALL) has been launched to benchmark Arabic language models, addressing the gap in resources for non-English NLP. It incorporates datasets like AlGhafa, ACVA, and translated versions of MMLU and EXAMS from the AceGPT suite. The leaderboard uses normalized log likelihood accuracy for tasks, built around HuggingFace’s LightEval framework. Why it matters: This initiative promotes research and development in Arabic NLP, serving over 380 million Arabic speakers by enhancing the evaluation and improvement of Arabic LLMs.

Technology Innovation Institute Announces Launch of NOOR, the World’s Largest Arabic NLP Model

TII · · NLP LLM

Technology Innovation Institute (TII) in Abu Dhabi, in collaboration with LightOn, has launched NOOR, a 10 billion parameter Arabic natural language processing (NLP) model. The model was trained on a large, high-quality cross-domain Arabic dataset including web data, books, poetry, news, and technical information. It enables applications in automated summarization, chatbots, and personalized marketing. Why it matters: NOOR represents a significant advancement in Arabic NLP, potentially enabling more sophisticated AI applications tailored to the Arabic language and regional needs.

Abu Dhabi’s TII Launches Falcon-H1 Arabic, Establishing the World’s Leading Arabic AI Model

TII · · NLP LLM

Abu Dhabi’s Technology Innovation Institute (TII) has launched Falcon-H1 Arabic, a new large language model based on a hybrid Mamba-Transformer architecture. The Falcon-H1 family comes in 3B, 7B, and 34B parameter sizes and outperforms existing models on the Open Arabic LLM Leaderboard (OALL). The model features improvements in data quality, dialect coverage, and long-context stability. Why it matters: This release strengthens the UAE's position in Arabic AI and provides a high-performing model tailored to the linguistic and cultural needs of the region.

SectEval: Evaluating the Latent Sectarian Preferences of Large Language Models

arXiv · · NLP LLM

The paper introduces SectEval, a new benchmark to evaluate sectarian biases in LLMs concerning Sunni and Shia Islam, available in English and Hindi. Results show significant inconsistencies in LLM responses based on language, with some models favoring Shia responses in English but Sunni in Hindi. Location-based experiments further reveal that advanced models adapt their responses based on the user's claimed country, while smaller models exhibit a consistent Sunni-leaning bias.

Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation

arXiv · · NLP LLM

MBZUAI researchers have developed an automatic interview system that uses LLMs to elicit nuanced, role-specific information from job candidates, improving early-stage hiring decisions. The system updates its belief about an applicant's rubric-oriented latent traits in a calibrated way based on their interview performance. Evaluation on simulated interviews showed the system's belief converges towards the simulated applicants' constructed ability levels.

ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

arXiv · · NLP LLM

The paper introduces ArabicNumBench, a benchmark for evaluating LLMs on Arabic number reading using both Eastern and Western Arabic numerals. It evaluates 71 models from 10 providers on 210 number reading tasks, using zero-shot, zero-shot CoT, few-shot, and few-shot CoT prompting strategies. The results show substantial performance variation, with few-shot CoT prompting achieving 2.8x higher accuracy than zero-shot approaches. Why it matters: The benchmark establishes baselines for Arabic number comprehension and provides guidance for model selection in production Arabic NLP systems.

ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning

arXiv · · NLP Arabic AI

The paper introduces ALPS (Arabic Linguistic & Pragmatic Suite), a diagnostic challenge set for evaluating deep semantics and pragmatics in Arabic NLP. The dataset contains 531 expert-curated questions across 15 tasks and 47 subtasks, designed to test morpho-syntactic dependencies and compositional semantics. Evaluation of 23 models, including commercial, open-source, and Arabic-native models, reveals that models struggle with fundamental morpho-syntactic dependencies, especially those reliant on diacritics. Why it matters: ALPS provides a valuable benchmark for evaluating the linguistic competence of Arabic NLP models, highlighting areas where current models fall short despite achieving high fluency.

From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

arXiv · · NLP LLM

Arabic Language Models (LMs) are primarily pretrained on Modern Standard Arabic (MSA), with an expectation of transferring to diverse Arabic dialects for real-world applications. This work explores cross-lingual transfer in Arabic LMs using probing on three Natural Language Processing (NLP) tasks and representational similarity. The findings indicate that transfer is possible but disproportionate across dialects, with some evidence of negative interference in models trained to support all Arabic dialects. Why it matters: This research highlights crucial challenges for building robust Arabic AI systems that effectively handle the significant linguistic diversity of the Arab world.

SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

arXiv · · NLP LLM

The paper introduces SalamahBench, a new benchmark for evaluating the safety of Arabic Language Models (ALMs). The benchmark comprises 8,170 prompts across 12 categories aligned with the MLCommons Safety Hazard Taxonomy. Five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, were evaluated using the benchmark. Why it matters: The benchmark enables standardized, category-aware safety evaluation, highlighting the necessity of specialized safeguard mechanisms for robust harm mitigation in ALMs.

DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding

arXiv · · CV NLP

MBZUAI researchers introduce DuwatBench, a new benchmark for multimodal understanding of Arabic calligraphy. The dataset contains 1,272 samples across six calligraphic styles with detailed annotations to evaluate visual-text alignment. Evaluation of 13 multimodal models reveals challenges in processing calligraphic variations and artistic distortions, highlighting the need for culturally grounded AI research.

Generative AI in Saudi Arabia: A National Survey of Adoption, Risks, and Public Perceptions

arXiv · · Research Policy

A national survey in Saudi Arabia of 330 participants reveals that 93% are actively using Generative AI, primarily for text-based tasks, while awareness and understanding remain uneven. Participants recognize benefits like productivity but caution against risks such as privacy, misinformation, and ethical misuse. The study highlights the need for AI literacy, culturally aligned solutions, and stronger frameworks for responsible deployment in Saudi Arabia.

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

arXiv · · LLM RL

The paper introduces Yet another Policy Optimization (YaPO), a reference-free method for learning sparse steering vectors in the latent space of a Sparse Autoencoder (SAE) to steer LLMs. By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Experiments show YaPO converges faster, achieves stronger performance, exhibits improved training stability and preserves general knowledge compared to dense steering baselines.

Community-Based Early-Stage Chronic Kidney Disease Screening using Explainable Machine Learning for Low-Resource Settings

arXiv · · Research Healthcare

This paper introduces an explainable machine learning framework for early-stage chronic kidney disease (CKD) screening, specifically designed for low-resource settings in Bangladesh and South Asia. The framework utilizes a community-based dataset from Bangladesh and evaluates multiple ML classifiers with feature selection techniques. Results show that the ML models achieve high accuracy and sensitivity, outperforming existing screening tools and demonstrating strong generalizability across independent datasets from India, the UAE, and Bangladesh.

Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

arXiv · · NLP Arabic AI

The paper introduces Ara-HOPE, a human-centric post-editing evaluation framework for Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation. Ara-HOPE includes a five-category error taxonomy and a decision-tree annotation protocol designed to address the challenges of dialect-specific MT errors. Evaluation of Jais, GPT-3.5, and NLLB-200 shows dialect-specific terminology and semantic preservation remain key challenges. Why it matters: The new framework and public dataset will help improve the evaluation and development of dialect-aware MT systems for Arabic.

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

arXiv · · CV Research

The paper introduces the Prism Hypothesis, which posits a correspondence between an encoder's feature spectrum and its functional role, with semantic encoders capturing low-frequency components and pixel encoders retaining high-frequency information. Based on this, the authors propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details using a frequency-band modulator. Experiments on ImageNet and MS-COCO demonstrate that UAE effectively unifies semantic abstraction and pixel-level fidelity, achieving state-of-the-art performance.

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv · · NLP Arabic AI

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv · · CV NLP

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region

arXiv · · CV Research

A new study compares vision-language models (VLMs) to YOLOv8 for wastewater treatment plant (WWTP) identification in satellite imagery across the MENA region. VLMs like Gemma-3 demonstrate superior zero-shot performance compared to YOLOv8, trained on a dataset of 83,566 satellite images from Egypt, Saudi Arabia, and UAE. The research suggests VLMs offer a scalable, annotation-free alternative for remote sensing of WWTPs.

FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models

arXiv · · NLP LLM

The paper introduces FanarGuard, a bilingual moderation filter for Arabic and English language models that considers both safety and cultural alignment. A dataset of 468K prompt-response pairs was created and scored by LLM judges on harmlessness and cultural awareness to train the filter. The first benchmark targeting Arabic cultural contexts was developed to evaluate cultural alignment. Why it matters: FanarGuard advances context-sensitive AI safeguards by integrating cultural awareness into content moderation, addressing a critical gap in current alignment techniques.

Cross-Document Topic-Aligned Chunking for Retrieval-Augmented Generation

arXiv · · NLP LLM

This paper introduces Cross-Document Topic-Aligned (CDTA) chunking to address knowledge fragmentation in Retrieval-Augmented Generation (RAG) systems. CDTA identifies topics across documents, maps segments to topics, and synthesizes them into unified chunks. Experiments on HotpotQA and UAE legal texts show that CDTA improves faithfulness and citation accuracy compared to existing chunking methods, especially for complex queries requiring multi-hop reasoning.

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

arXiv · · NLP LLM

A new method is proposed to reduce the verbosity of LLMs in step-by-step reasoning by retaining moderately easy problems during Reinforcement Learning with Verifiable Rewards (RLVR) training. This approach acts as an implicit length regularizer, preventing the model from excessively increasing output length on harder problems. Experiments using Qwen3-4B-Thinking-2507 show the model achieves baseline accuracy with nearly twice shorter solutions.

LLM-based Multi-class Attack Analysis and Mitigation Framework in IoT/IIoT Networks

arXiv · · Research NLP

This paper introduces a framework that combines machine learning for multi-class attack detection in IoT/IIoT networks with large language models (LLMs) for attack behavior analysis and mitigation suggestion. The framework uses role-play prompt engineering with RAG to guide LLMs like ChatGPT-o3 and DeepSeek-R1, and introduces new evaluation metrics for quantitative assessment. Experiments using Edge-IIoTset and CICIoT2023 datasets showed Random Forest as the best detection model and ChatGPT-o3 outperforming DeepSeek-R1 in attack analysis and mitigation.

Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding

arXiv · · NLP LLM

MASARAT SA has developed Mubeen, a proprietary Arabic language model specializing in Arabic linguistics, Islamic studies, and cultural heritage. Mubeen was trained using native Arabic sources, including digitized historical manuscripts processed via a proprietary Arabic OCR engine. The model employs a Practical Closure Architecture to improve user intent understanding and provide decisive guidance. Why it matters: Mubeen addresses the utility gap in current Arabic LLMs by focusing on native Arabic data and cultural authenticity, which is critical for heritage preservation and alignment with Saudi Vision 2030.

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

arXiv · · NLP LLM

This survey paper analyzes over 40 benchmarks used to evaluate Arabic large language models, categorizing them into Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. It identifies progress in benchmark diversity but also highlights gaps like limited temporal evaluation and cultural misalignment. The paper also examines methods for creating benchmarks, including native collection, translation, and synthetic generation. Why it matters: The survey provides a comprehensive reference for Arabic NLP research and offers recommendations for future benchmark development to better align with cultural contexts.

Developing and Validating the Arabic Version of the Attitudes Toward Large Language Models Scale

arXiv · · NLP LLM

This paper presents the development and validation of an Arabic version of the Attitudes Toward Large Language Models (AT-GLLM and AT-PLLM) scales, adapted from the original English versions. The study involved translating the scales and testing them on a sample of 249 Arabic-speaking adults. The translated scales demonstrated strong psychometric properties, including a two-factor structure, measurement invariance across genders, and good reliability and validity. Why it matters: This provides a culturally relevant tool for assessing attitudes toward LLMs in the Arab world, crucial for localized research and policy-making in the rapidly growing field of Arabic AI.

ALARB: An Arabic Legal Argument Reasoning Benchmark

arXiv · · NLP LLM

Researchers introduce ALARB, a new benchmark for evaluating reasoning in Arabic LLMs using 13K Saudi commercial court cases. The benchmark includes tasks like verdict prediction, reasoning chain completion, and identification of relevant regulations. Instruction-tuning a 12B parameter model on ALARB achieves performance comparable to GPT-4o in verdict prediction and generation.

Gender Stereotypes in Professional Roles Among Saudis: An Analytical Study of AI-Generated Images Using Language Models

arXiv · · Research Ethics

The study analyzes over 1,000 images generated by ImageFX, DALL-E V3, and Grok for 56 Saudi professions, finding significant gender imbalances and cultural inaccuracies. DALL-E V3 exhibited the strongest gender stereotyping, with 96% male depictions, particularly in leadership and technical roles. The research underscores the need for diverse training data and culturally sensitive evaluation to ensure equitable AI outputs that accurately reflect Saudi Arabia's labor market and culture.

Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

arXiv · · NLP LLM

The Hala technical report introduces a family of Arabic-centric instruction and translation models developed using a translate-and-tune pipeline. A strong Arabic-English teacher model is compressed to FP8 and used to create bilingual supervision data. The LFM2-1.2B model is fine-tuned on this data and used to translate English instruction sets into Arabic, creating a million-scale corpus. Why it matters: The release of models, data, evaluation tools, and recipes will accelerate research and development in Arabic NLP, providing valuable resources for the community.