Sources ›

arXiv

arXiv preprint server — articles filtered for GCC-affiliated authors in AI, Machine Learning, NLP, Computer Vision, Information Retrieval, Robotics, and Statistics. Covers research from MBZUAI, KAUST, TII, Khalifa University, QCRI, and collaborating institutions.

https://arxiv.org →

1–50 articles

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

arXiv · Jun 16 · NLP Arabic AI

This paper presents a methodology for digitizing and encoding the Al-Mawrid Arabic-English dictionary, transforming it into a standardized computational lexicon using the ISO Lexical Markup Framework (LMF) and TEI Lex-0 guidelines. The research, based on an empirical analysis of the letter Ayn (4.6% of the dictionary), achieved a structural parsing accuracy of 91%. Quantitative evaluation showed high performance for information extraction rules, including 85% precision and 98% recall for synonyms. Why it matters: This work addresses a significant gap in Arabic lexical infrastructure, providing an interoperable, machine-tractable resource and a reproducible workflow for retro-digitizing complex legacy bilingual lexicons for Arabic NLP and Digital Humanities.

Interpretable Crisis Behavior Analysis Using Mobility and Social Media Data

arXiv · Jun 8 · Research Policy

This paper introduces an interpretable pipeline that integrates mobility and social media data to analyze human behavior during crises. The framework was evaluated through two case studies, including a longitudinal analysis of UAE COVID-19 behavior from March 2020 to December 2021. The pipeline aligns heterogeneous daily signals, transforms them into binary behavioral states, applies Formal Concept Analysis (FCA) to extract co-occurrence structures, and mines association rules. Results demonstrate clear cross-domain behavioral structures in crises, yielding both scientifically credible and policy-actionable intelligence. Why it matters: This work provides a novel methodological approach for developing actionable crisis management strategies by fusing multimodal data, directly applicable to public health and emergency response in the UAE and the broader region.

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

arXiv · Jun 4 · Research LLM

Researchers have introduced BloomBench, a new cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for Vision-Language Models (VLMs), as part of the Almieyar benchmarking series. Grounded in Bloom's Taxonomy, it systematically evaluates six levels of cognition—Remember, Understand, Apply, Analyze, Evaluate, Create—through carefully designed image-question-answer tasks. A comprehensive study using BloomBench revealed that state-of-the-art VLMs exhibit strong semantic understanding but struggle significantly with factual recall and creative synthesis, alongside a critical performance gap between Arabic and English. Why it matters: This benchmark provides a crucial tool for diagnosing cognitive weaknesses in current VLMs and lays the groundwork for developing more cognitively aligned and inclusive multimodal AI, particularly for cross-lingual applications.

An NLP-Driven Framework for Curriculum-Labor Market Alignment: Schema-Constrained LLM Extraction, ESCO-Anchored Semantic Matching, and Multi-Dimensional Gap Quantification

arXiv · Jun 1 · NLP LLM

Researchers proposed a four-stage NLP framework combining schema-constrained LLM extraction, Sentence-BERT (SBERT) alignment with ESCO, an adjudication protocol, and a verification mechanism for curriculum-labor market alignment. The framework was instantiated for the ABET-accredited BSc Computer Science program at the United Arab Emirates University (UAEU), extracting 400 competency records from the study plan and aligning them with 30 job postings. The extractor achieved a Cohen's kappa of 0.79 on the skill slot and surfaced interpretable supply-demand gaps in general, transversal, algorithms, and software engineering skills, with a minimal gap in AI and data science. Why it matters: This framework provides a robust, NLP-driven method to identify crucial skill gaps in higher education curricula, directly supporting quality assurance and workforce development initiatives in the region.

Uncovering Temporal Framing in the News

arXiv · May 29 · NLP Research

Researchers from MBZUAI have proposed a new taxonomy of eight temporal frames and studied their persuasive use in news discourse. They created a multilingual dataset by expertly annotating 458 English and German news articles, identifying over 2,000 temporally framed sentences and approximately 3,000 annotations. Their experiments demonstrated that temporal framing is learnable at the sentence level, with supervised models significantly outperforming zero-shot classification approaches. Why it matters: This research provides a valuable dataset and methodology for understanding how time-related language shapes interpretation in news, contributing to advancements in NLP for media analysis and potentially countering disinformation.

YOLO26-RipeLoc Lite: A lightweight architecture for tomato ripeness detection and picking point localization in greenhouse robotic harvesting

arXiv · May 26 · Research Robotics

YOLO26-RipeLoc Lite is a new lightweight deep learning architecture designed for simultaneous detection, ripeness classification, and center-point localization of greenhouse tomatoes for robotic harvesting. The model incorporates a Lightweight Feature Pyramid Network, a Ripeness-Aware Attention Module, and a Compact Detection Head for efficient and precise operation. Evaluated on a custom dataset from the SILAL greenhouse in Abu Dhabi, UAE, it achieved a [email protected] of 92.9% with only 2.38 million parameters, outperforming existing YOLO models in accuracy-efficiency. Why it matters: This research provides an efficient and accurate solution for automating a critical agricultural process, enhancing food security and technological capabilities in the region's greenhouse farming.

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

arXiv · May 21 · NLP Arabic AI

ArabDiscrim is a new corpus comprising 293,000 public Arabic Facebook posts from 2014 to 2024, specifically curated to discuss racism and discrimination. Unlike prior Twitter-centric datasets, it incorporates platform-native engagement signals, 200 curated terms with morphological regex families, and 20 discrimination axes. The resource also provides explicit attribution patterns and is released under a restricted research-use license for ethical compliance. Why it matters: This dataset provides a unique, ecologically valid foundation for fairness-oriented and platform-aware Arabic Natural Language Processing, moving beyond existing Twitter-centric resources.

JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media

arXiv · May 20 · NLP Arabic AI

Researchers have introduced JobArabi, a new large-scale corpus consisting of 20,528 Arabic job announcements collected from X between January 2024 and October 2025. The dataset was compiled using a linguistically informed query framework covering various Arabic recruitment expressions, offering metadata like timestamps and geolocation for detailed analysis. Quantitative analysis of JobArabi reveals sociolinguistic patterns, including persistent gendered hiring language, regional occupational demand variations, and emotional framing in recruitment messages. Why it matters: This corpus provides a valuable resource for research in Arabic NLP, computational social science, and digital labor studies, offering unique insights into labor market communication and linguistic change in the Arab world.

Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems

arXiv · May 20 · NLP Arabic AI

This paper reflects on two decades of building NLP resources and research infrastructure for Arabic, an historically underserved language. The first decade focused on foundational linguistic infrastructure, while the second shifted towards computational social science and socially oriented applications. The authors highlight three lessons: dataset building is a social process, communities often matter more than shared tasks, and computational social science exposes challenges beyond traditional NLP training. Why it matters: The paper argues that the most difficult problems in developing NLP for underserved communities are social, institutional, and epistemic, offering critical insights for future research directions in Arabic AI.

LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets

arXiv · May 19 · NLP LLM

Researchers developed an Arabic NLP framework designed for large-scale financial sentiment analysis specifically tailored to the Saudi market. The framework integrates official financial news and social media, constructing an 84K-sample Arabic financial corpus through a multi-stage pipeline encompassing data collection, cleaning, and sentiment annotation. It employs Transformer-based NER and a curated company lexicon to link textual mentions to canonical company identifiers, assigning five-class sentiment labels for analyzing sentiment dynamics relative to stock market behavior on the Saudi Exchange. Why it matters: This research addresses a critical gap in Arabic financial NLP resources, offering a scalable method to understand investor sentiment in a key Middle Eastern market.

The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias

arXiv · May 6 · LLM Research

This study introduces a Probabilistic Graphical Model (PGM) framework utilizing Pearl's do-operator to causally audit LLM safety mechanisms, specifically isolating the effect of injecting cultural demographics into prompts. A large-scale empirical analysis was conducted across seven instruction-tuned models from diverse origins, including the UAE's Falcon3-7B, as well as models from the US, Europe, China, and India, using ToxiGen and BOLD datasets. The findings revealed a disparity between observational and interventional bias, demonstrating that standard fairness metrics can overestimate demographic bias. Western models exhibited higher causal refusal rates for specific demographic groups, while Eastern models showed low overall intervention rates with targeted sensitivities toward regional demographics. Why it matters: This research highlights the geopolitical nuances of LLM safety alignment and the potential for demographic-sensitive over-triggering to restrict benign discourse, which is particularly relevant for diverse regions like the Middle East in developing culturally-aware AI.

Climate-based Pre-screening of Self-sustaining Regreening Opportunities in Drylands: A Case Study for Saudi Arabia

arXiv · May 5 · Research ML

Researchers have developed a scalable pre-screening framework that integrates climate and remote sensing data to identify cost-efficient sites for sustainable dryland restoration, using Saudi Arabia as a case study. The framework employs machine learning models to derive a Climate Suitability Score (CSS), which captures climatic dependencies on vegetation persistence. National-scale prediction maps were generated using multi-year ERA5-Land data for Saudi Arabia, leading to the identification of thirteen priority locations with an estimated potential for a 2.5-fold increase in vegetation coverage. Why it matters: This approach significantly reduces the search space and costs associated with restoration efforts, supporting more resilient and sustainable ecosystem recovery planning in water-limited regions of the Middle East.

The Cylindrical Representation Hypothesis for Language Model Steering

arXiv · May 3 · LLM NLP

Researchers from MBZUAI have proposed the Cylindrical Representation Hypothesis (CRH) to explain the instability and unpredictability observed in large language model steering. CRH relaxes the orthogonality assumption of the existing Linear Representation Hypothesis, positing a cylindrical structure where a central axis captures concept differences and a surrounding normal plane controls steering sensitivity. The hypothesis suggests that the intrinsic uncertainty in identifying specific sensitive sectors within this normal plane accounts for why steering outcomes frequently fluctuate even with well-aligned directions. Why it matters: This research offers a more robust theoretical framework for understanding and potentially improving the control and reliability of large language models.

The Cylindrical Representation Hypothesis for Language Model Steering

arXiv · May 3 · NLP LLM

Researchers have proposed the Cylindrical Representation Hypothesis (CRH) to address the instability and unpredictability observed in steering large language models, an issue not fully explained by the existing Linear Representation Hypothesis (LRH). CRH suggests that overlapping concept contributions lead to a sample-specific axis-orthogonal structure, comprising a central axis for concept generation and a surrounding normal plane for steering sensitivity. This framework identifies intrinsic uncertainty at the 'sensitive sector' level within the plane, providing a principled explanation for fluctuations in steering outcomes. Experiments verify the existence of this cylindrical structure and demonstrate CRH's practical utility in interpreting real-world model steering behavior, with code available on GitHub from mbzuai-nlp. Why it matters: This research from MBZUAI offers a crucial theoretical advancement in understanding and potentially improving the control and reliability of large language models.

Governing What the EU AI Act Excludes: Accountability for Autonomous AI Agents in Smart City Critical Infrastructure

arXiv · May 1 · Policy Ethics

This research paper identifies an accountability deficit for autonomous AI agents operating in smart city critical infrastructure under the EU AI Act, noting that specific provisions exclude safety-component AI from certain explanation rights and impact assessments. It proposes AgentGov-SC, a three-layer governance architecture specifying 25 measures, 5 conflict resolution rules, and an autonomy-calibrated activation model, with bidirectional traceability to established AI frameworks. A scenario analysis traces the governance activation through a multi-agent corridor cascade involving documented UAE smart-city systems. Why it matters: This paper addresses a significant regulatory gap in AI governance for complex, multi-agent systems in critical urban infrastructure, offering a novel architectural solution highly relevant to global smart city initiatives, including those in the Middle East.

Instruction-Guided Poetry Generation in Arabic and Its Dialects

arXiv · Apr 30 · NLP LLM

Researchers at MBZUAI have developed a new method for controllable poetry generation in Arabic and its dialects, moving beyond traditional analysis tasks for Arabic poetry within Large Language Models (LLMs). They introduce a large-scale, instruction-based dataset in Modern Standard Arabic (MSA) and various Arabic dialects, enabling LLMs to perform tasks like writing, revising, and continuing poems based on user criteria. Experiments show that fine-tuning LLMs on this dataset results in models capable of generating poetry aligned with user requirements, validated by automated metrics and human evaluation. Why it matters: This work represents a significant advancement in Arabic Natural Language Processing, offering tools for creative expression and cultural preservation while opening new avenues for user-guided content generation in culturally rich text forms.

Culturally Aware GenAI Risks for Youth: Perspectives from Youth, Parents, and Teachers in a Non-Western Context

arXiv · Apr 29 · Research Ethics

A study investigated the culturally aware risks of Generative AI for youth aged 7-17 in Saudi Arabia, focusing on privacy and safety challenges. Researchers analyzed 736 Reddit posts, 1,262 X (Twitter) posts, and conducted interviews with 31 Saudi participants including youth, parents, and teachers. Findings highlighted context-dependent risks, particularly regarding the disclosure of personal and family information that conflicts with culturally rooted expectations of modesty, privacy, and honor. The study proposes design implications for inclusive, context-sensitive parental controls that align with local cultural norms and values. Why it matters: This research is crucial for developing AI tools and policies that are culturally appropriate and safeguard youth in non-Western contexts like the Middle East.

Dual Pose-Graph Semantic Localization for Vision-Based Autonomous Drone Racing

arXiv · Apr 16 · Robotics CV

This work presents a dual pose-graph architecture for robust real-time localization in autonomous drone racing. The system fuses monocular visual-inertial odometry with semantic gate detections, using a temporary graph to optimize multiple observations into refined constraints before promoting them to a persistent main graph. Evaluated on the TII-RATM dataset and deployed in the A2RL competition, it achieved a 56-74% reduction in Absolute Trajectory Error (ATE) compared to standalone VIO and reduced odometry drift by up to 4.2 meters per lap. Why it matters: This research significantly improves the reliability and accuracy of vision-based localization for high-speed autonomous drones, crucial for advanced robotics applications and competitive racing.

RightNow-Arabic-0.5B-Turbo: An Open Sub-1B Arabic Language Model via Vocabulary Injection and Edge-First Deployment

arXiv · Apr 10 · LLM Arabic AI

RightNow-Arabic-0.5B-Turbo is a new 518M-parameter Arabic-specialized decoder LLM, built on Qwen2.5-0.5B, designed to bridge the gap between small multilingual and large Arabic-specialized models. Its development pipeline included adding 27,032 Arabic tokens via vocabulary injection, continued pretraining on 504M Arabic tokens, and fine-tuning with supervised instruction and direct preference optimization. The model achieved a 35.9% mean accuracy on three Arabic benchmarks (COPA-ar, Arabic HellaSwag, ArabicMMLU), outperforming all same-class open models and recovering 67% of SILMA-9B's mean accuracy at 1/18 the parameters, with all code and weights publicly released. Why it matters: This model significantly advances efficient Arabic NLP by providing a powerful, specialized sub-1B LLM suitable for edge deployment, making advanced Arabic AI more accessible and performant on resource-constrained devices.

State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

arXiv · Apr 7 · NLP LLM

Arabic-DeepSeek-R1 is an application-driven, open-source Arabic Large Language Model (LLM) that has achieved a new state-of-the-art (SOTA) across the Open Arabic LLM Leaderboard (OALL). The model utilizes a sparse Mixture-of-Experts (MoE) backbone and a four-phase Chain-of-Thought (CoT) distillation scheme, which incorporates Arabic-specific linguistic verification and regional ethical norms. It records the highest average score on the OALL suite and outperforms proprietary frontier systems like GPT-5.1 on a majority of benchmarks evaluating comprehensive Arabic language-specific tasks. Why it matters: This work offers a validated and cost-effective framework for developing high-performing, culturally-grounded AI for under-represented languages, addressing the digital equity gap.

Severity-Aware Weighted Loss for Arabic Medical Text Generation

arXiv · Apr 7 · NLP LLM

Researchers proposed a severity-aware weighted loss method to fine-tune Arabic language models for medical text generation, prioritizing severe clinical cases. This approach utilizes soft severity probabilities, derived from an AraBERT-based classifier, to dynamically scale token-level loss contributions during optimization on the MAQA dataset. The method consistently improved performance across ten Arabic LLMs, with AraGPT2-Base increasing from 54.04% to 66.14% and AraGPT2-Medium from 59.16% to 67.18%. Why it matters: This novel fine-tuning strategy addresses a critical limitation in medical AI by enhancing the safety and reliability of Arabic medical large language models, particularly in high-stakes clinical scenarios.

Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

arXiv · Apr 6 · NLP Research

Researchers have developed OmniScore, a family of deterministic learned metrics designed to evaluate generative text as an alternative to Large Language Models (LLMs) used as judges. OmniScore leverages small parameter models (<1B) and was trained on approximately 564,000 synthetic instances across 107 languages, then evaluated using 8,617 manually annotated instances. It approximates LLM-judge behavior while offering low latency and consistency for various evaluation settings like reference-based and source-grounded assessments in tasks like QA, translation, and summarization. Why it matters: This development provides a practical, scalable, and reproducible method for multilingual generative text evaluation, addressing key limitations of LLM-as-a-judge approaches and offering significant benefits for AI development in linguistically diverse regions.

Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation

arXiv · Apr 3 · LLM NLP

QIMMA is introduced as a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. It employs a multi-model assessment pipeline combining automated LLM judgment with human review to identify and resolve quality issues in established Arabic benchmarks. The resulting evaluation suite comprises over 52,000 samples, predominantly grounded in native Arabic content, with transparent implementation via LightEval and EvalPlus. Why it matters: This initiative provides a more reliable and reproducible foundation for evaluating Arabic Large Language Models, addressing critical quality concerns in existing benchmarks.

World Reasoning Arena

arXiv · Mar 26 · Research LLM

Researchers from MBZUAI have introduced WR-Arena, a new comprehensive benchmark designed to evaluate World Models (WMs) beyond traditional next-state prediction and visual fidelity. WR-Arena assesses WMs across three core dimensions: Action Simulation Fidelity, Long-horizon Forecast, and Simulative Reasoning and Planning, using a curated task taxonomy and diverse datasets. Extensive experiments with state-of-the-art WMs revealed a significant gap between current models' capabilities and human-level hypothetical reasoning. Why it matters: This benchmark provides a critical diagnostic tool and guideline for developing more robust and intelligent world models capable of advanced understanding, forecasting, and purposeful action, particularly for AI research in the region.

Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

arXiv · Mar 25 · NLP LLM

Researchers developed a retrieval-augmented generation (RAG) framework to improve Arabic Large Language Models (LLMs) in understanding complex historical and religious texts like the Quran and Hadith. This framework grounds LLMs in the Doha Historical Dictionary of Arabic (DHDA) through hybrid retrieval and intent-based routing. The approach significantly boosted the accuracy of Arabic-native LLMs such as Fanar and ALLaM to over 85%, closing the performance gap with proprietary models like Gemini. Why it matters: This research offers a novel method for enhancing Arabic NLP capabilities for historically nuanced texts, demonstrating the value of integrating diachronic lexicographic resources into RAG systems for deeper language understanding.

CoVR-R:Reason-Aware Composed Video Retrieval

arXiv · Mar 20 · CV RL

A new approach to composed video retrieval (CoVR) is presented, which leverages large multimodal models to infer causal and temporal consequences implied by an edit. The method aligns reasoned queries to candidate videos without task-specific finetuning. A new benchmark, CoVR-Reason, is introduced to evaluate reasoning in CoVR.

Fanar 2.0: Arabic Generative AI Stack

arXiv · Mar 17 · LLM Arabic AI

Hamad Bin Khalifa University (HBKU) has released Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform, built entirely at QCRI. The core of Fanar 2.0 is Fanar-27B, which was continually pre-trained from a Gemma-3-27B backbone using 120 billion high-quality tokens and only 256 NVIDIA H100 GPUs. Fanar 2.0 includes capabilities like FanarGuard, Aura, Oryx, Fanar-Sadiq, Fanar-Diwan, and FanarShaheen for moderation, speech recognition, vision understanding, Islamic content, poetry generation, and translation. Why it matters: This shows that sovereign, resource-constrained AI development in the Arabic language is possible, producing competitive systems in the region.

SectEval: Evaluating the Latent Sectarian Preferences of Large Language Models

arXiv · Mar 13 · NLP LLM

The paper introduces SectEval, a new benchmark to evaluate sectarian biases in LLMs concerning Sunni and Shia Islam, available in English and Hindi. Results show significant inconsistencies in LLM responses based on language, with some models favoring Shia responses in English but Sunni in Hindi. Location-based experiments further reveal that advanced models adapt their responses based on the user's claimed country, while smaller models exhibit a consistent Sunni-leaning bias.

Reinforcement learning-based dynamic cleaning scheduling framework for solar energy system

arXiv · Mar 8 · RL Robotics

This study introduces a reinforcement learning (RL) framework using Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) to optimize the cleaning schedules of photovoltaic panels in arid regions. Applied to a case study in Abu Dhabi, the PPO-based framework demonstrated up to 13% cost savings compared to simulation optimization methods by dynamically adjusting cleaning intervals based on environmental conditions. The research highlights the potential of RL in enhancing the efficiency and reducing the operational costs of solar power generation.

Robust Tightly-Coupled Filter-Based Monocular Visual-Inertial State Estimation and Graph-Based Evaluation for Autonomous Drone Racing

arXiv · Mar 3 · Robotics Research

This paper introduces ADR-VINS, a monocular visual-inertial state estimation framework based on an Error-State Kalman Filter (ESKF) designed for autonomous drone racing, integrating direct pixel reprojection errors from gate corners as innovation terms. It also introduces ADR-FGO, an offline Factor-Graph Optimization framework for generating high-fidelity reference trajectories for post-flight evaluation in GNSS-denied environments. Validated on the TII-RATM dataset, ADR-VINS achieved an average RMS translation error of 0.134 m and was successfully deployed in the A2RL Drone Championship Season 2. Why it matters: The framework provides a robust and efficient solution for drone state estimation in challenging racing environments, and enables performance evaluation without relying on external localization systems.

Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation

arXiv · Mar 2 · NLP LLM

MBZUAI researchers have developed an automatic interview system that uses LLMs to elicit nuanced, role-specific information from job candidates, improving early-stage hiring decisions. The system updates its belief about an applicant's rubric-oriented latent traits in a calibrated way based on their interview performance. Evaluation on simulated interviews showed the system's belief converges towards the simulated applicants' constructed ability levels.

ILION: Deterministic Pre-Execution Safety Gates for Agentic AI Systems

arXiv · Feb 22 · RL Ethics

The paper introduces ILION, a deterministic execution gate designed to ensure the safety of autonomous AI agents by classifying proposed actions as either BLOCK or ALLOW. ILION uses a five-component cascade architecture that operates without statistical training, API dependencies, or labeled data. Evaluation against existing text-safety infrastructures demonstrates ILION's superior performance in preventing unauthorized actions, achieving an F1 score of 0.8515 with sub-millisecond latency.

ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

arXiv · Feb 21 · NLP LLM

The paper introduces ArabicNumBench, a benchmark for evaluating LLMs on Arabic number reading using both Eastern and Western Arabic numerals. It evaluates 71 models from 10 providers on 210 number reading tasks, using zero-shot, zero-shot CoT, few-shot, and few-shot CoT prompting strategies. The results show substantial performance variation, with few-shot CoT prompting achieving 2.8x higher accuracy than zero-shot approaches. Why it matters: The benchmark establishes baselines for Arabic number comprehension and provides guidance for model selection in production Arabic NLP systems.

ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning

arXiv · Feb 19 · NLP Arabic AI

The paper introduces ALPS (Arabic Linguistic & Pragmatic Suite), a diagnostic challenge set for evaluating deep semantics and pragmatics in Arabic NLP. The dataset contains 531 expert-curated questions across 15 tasks and 47 subtasks, designed to test morpho-syntactic dependencies and compositional semantics. Evaluation of 23 models, including commercial, open-source, and Arabic-native models, reveals that models struggle with fundamental morpho-syntactic dependencies, especially those reliant on diacritics. Why it matters: ALPS provides a valuable benchmark for evaluating the linguistic competence of Arabic NLP models, highlighting areas where current models fall short despite achieving high fluency.

From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

arXiv · Feb 10 · NLP LLM

Arabic Language Models (LMs) are primarily pretrained on Modern Standard Arabic (MSA), with an expectation of transferring to diverse Arabic dialects for real-world applications. This work explores cross-lingual transfer in Arabic LMs using probing on three Natural Language Processing (NLP) tasks and representational similarity. The findings indicate that transfer is possible but disproportionate across dialects, with some evidence of negative interference in models trained to support all Arabic dialects. Why it matters: This research highlights crucial challenges for building robust Arabic AI systems that effectively handle the significant linguistic diversity of the Arab world.

SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

arXiv · Feb 3 · NLP LLM

The paper introduces SalamahBench, a new benchmark for evaluating the safety of Arabic Language Models (ALMs). The benchmark comprises 8,170 prompts across 12 categories aligned with the MLCommons Safety Hazard Taxonomy. Five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, were evaluated using the benchmark. Why it matters: The benchmark enables standardized, category-aware safety evaluation, highlighting the necessity of specialized safeguard mechanisms for robust harm mitigation in ALMs.

DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding

arXiv · Jan 27 · CV NLP

MBZUAI researchers introduce DuwatBench, a new benchmark for multimodal understanding of Arabic calligraphy. The dataset contains 1,272 samples across six calligraphic styles with detailed annotations to evaluate visual-text alignment. Evaluation of 13 multimodal models reveals challenges in processing calligraphic variations and artistic distortions, highlighting the need for culturally grounded AI research.

Generative AI in Saudi Arabia: A National Survey of Adoption, Risks, and Public Perceptions

arXiv · Jan 26 · Research Policy

A national survey in Saudi Arabia of 330 participants reveals that 93% are actively using Generative AI, primarily for text-based tasks, while awareness and understanding remain uneven. Participants recognize benefits like productivity but caution against risks such as privacy, misinformation, and ethical misuse. The study highlights the need for AI literacy, culturally aligned solutions, and stronger frameworks for responsible deployment in Saudi Arabia.

MonoRace: Winning Champion-Level Drone Racing with Robust Monocular AI

arXiv · Jan 21 · Robotics RL

The paper presents MonoRace, an onboard drone racing approach using a monocular camera and IMU. The system combines neural-network-based gate segmentation with a drone model for robust state estimation, along with offline optimization using gate geometry. MonoRace won the 2025 Abu Dhabi Autonomous Drone Racing Competition (A2RL), outperforming AI teams and human world champions, reaching speeds up to 100 km/h. Why it matters: This demonstrates a significant advancement in autonomous drone racing, achieving champion-level performance with a resource-efficient monocular system, validated in a real-world competition setting in the UAE.

Hybrid Deep Feature Extraction and ML for Construction and Demolition Debris Classification

arXiv · Jan 20 · CV Research

This paper introduces a hybrid deep learning and machine learning pipeline for classifying construction and demolition waste. A dataset of 1,800 images from UAE construction sites was created, and deep features were extracted using a pre-trained Xception network. The combination of Xception features with machine learning classifiers achieved up to 99.5% accuracy, demonstrating state-of-the-art performance for debris identification.

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

arXiv · Jan 13 · LLM RL

The paper introduces Yet another Policy Optimization (YaPO), a reference-free method for learning sparse steering vectors in the latent space of a Sparse Autoencoder (SAE) to steer LLMs. By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Experiments show YaPO converges faster, achieves stronger performance, exhibits improved training stability and preserves general knowledge compared to dense steering baselines.

Community-Based Early-Stage Chronic Kidney Disease Screening using Explainable Machine Learning for Low-Resource Settings

arXiv · Jan 3 · Research Healthcare

This paper introduces an explainable machine learning framework for early-stage chronic kidney disease (CKD) screening, specifically designed for low-resource settings in Bangladesh and South Asia. The framework utilizes a community-based dataset from Bangladesh and evaluates multiple ML classifiers with feature selection techniques. Results show that the ML models achieve high accuracy and sensitivity, outperforming existing screening tools and demonstrating strong generalizability across independent datasets from India, the UAE, and Bangladesh.

Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

arXiv · Dec 25 · NLP Arabic AI

The paper introduces Ara-HOPE, a human-centric post-editing evaluation framework for Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation. Ara-HOPE includes a five-category error taxonomy and a decision-tree annotation protocol designed to address the challenges of dialect-specific MT errors. Evaluation of Jais, GPT-3.5, and NLLB-200 shows dialect-specific terminology and semantic preservation remain key challenges. Why it matters: The new framework and public dataset will help improve the evaluation and development of dialect-aware MT systems for Arabic.

Drift-Corrected Monocular VIO and Perception-Aware Planning for Autonomous Drone Racing

arXiv · Dec 23 · Robotics RL

This paper details the autonomous drone racing system developed for the Abu Dhabi Autonomous Racing League (A2RL) x Drone Champions League competition. The system uses drift-corrected monocular Visual-Inertial Odometry (VIO) fused with YOLO-based gate detection for global position measurements, managed via Kalman filter. A perception-aware planner generates trajectories balancing speed and gate visibility. Why it matters: The system's podium finishes validate the effectiveness of monocular vision-based autonomous drone flight and showcases advancements in AI-powered robotics within the UAE.

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

arXiv · Dec 22 · CV Research

The paper introduces the Prism Hypothesis, which posits a correspondence between an encoder's feature spectrum and its functional role, with semantic encoders capturing low-frequency components and pixel encoders retaining high-frequency information. Based on this, the authors propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details using a frequency-band modulator. Experiments on ImageNet and MS-COCO demonstrate that UAE effectively unifies semantic abstraction and pixel-level fidelity, achieving state-of-the-art performance.

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv · Dec 20 · NLP Arabic AI

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv · Dec 18 · CV NLP

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region

arXiv · Dec 16 · CV Research

A new study compares vision-language models (VLMs) to YOLOv8 for wastewater treatment plant (WWTP) identification in satellite imagery across the MENA region. VLMs like Gemma-3 demonstrate superior zero-shot performance compared to YOLOv8, trained on a dataset of 83,566 satellite images from Egypt, Saudi Arabia, and UAE. The research suggests VLMs offer a scalable, annotation-free alternative for remote sensing of WWTPs.

OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving

arXiv · Dec 16 · CV RL

The paper introduces OmniGen, a unified framework for generating aligned multimodal sensor data for autonomous driving using a shared Bird's Eye View (BEV) space. It uses a novel generalizable multimodal reconstruction method (UAE) to jointly decode LiDAR and multi-view camera data through volume rendering. The framework incorporates a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation, demonstrating good performance and multimodal consistency.

Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

arXiv · Nov 28 · CV RL

Researchers at MBZUAI have introduced Video-R2, a reinforcement learning approach to improve the consistency and visual grounding of reasoning in multimodal language models. Video-R2 combines timestamp-aware supervised fine-tuning with Group Relative Policy Optimization (GRPO) guided by a Temporal Alignment Reward (TAR). The model demonstrates higher Think Answer Consistency (TAC), Video Attention Score (VAS), and accuracy across multiple benchmarks, showing improved temporal alignment and reasoning coherence for video understanding.

Older →