Skip to content
GCC AI Research

Sources

arXiv

arXiv preprint server — articles filtered for GCC-affiliated authors in AI, Machine Learning, NLP, Computer Vision, Information Retrieval, Robotics, and Statistics. Covers research from MBZUAI, KAUST, TII, Khalifa University, QCRI, and collaborating institutions.

https://arxiv.org →

101–150 articles · Page 3

The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias

arXiv · · LLM Research

This study introduces a Probabilistic Graphical Model (PGM) framework utilizing Pearl's do-operator to causally audit LLM safety mechanisms, specifically isolating the effect of injecting cultural demographics into prompts. A large-scale empirical analysis was conducted across seven instruction-tuned models from diverse origins, including the UAE's Falcon3-7B, as well as models from the US, Europe, China, and India, using ToxiGen and BOLD datasets. The findings revealed a disparity between observational and interventional bias, demonstrating that standard fairness metrics can overestimate demographic bias. Western models exhibited higher causal refusal rates for specific demographic groups, while Eastern models showed low overall intervention rates with targeted sensitivities toward regional demographics. Why it matters: This research highlights the geopolitical nuances of LLM safety alignment and the potential for demographic-sensitive over-triggering to restrict benign discourse, which is particularly relevant for diverse regions like the Middle East in developing culturally-aware AI.

Climate-based Pre-screening of Self-sustaining Regreening Opportunities in Drylands: A Case Study for Saudi Arabia

arXiv · · Research ML

Researchers have developed a scalable pre-screening framework that integrates climate and remote sensing data to identify cost-efficient sites for sustainable dryland restoration, using Saudi Arabia as a case study. The framework employs machine learning models to derive a Climate Suitability Score (CSS), which captures climatic dependencies on vegetation persistence. National-scale prediction maps were generated using multi-year ERA5-Land data for Saudi Arabia, leading to the identification of thirteen priority locations with an estimated potential for a 2.5-fold increase in vegetation coverage. Why it matters: This approach significantly reduces the search space and costs associated with restoration efforts, supporting more resilient and sustainable ecosystem recovery planning in water-limited regions of the Middle East.

The Cylindrical Representation Hypothesis for Language Model Steering

arXiv · · LLM NLP

Researchers from MBZUAI have proposed the Cylindrical Representation Hypothesis (CRH) to explain the instability and unpredictability observed in large language model steering. CRH relaxes the orthogonality assumption of the existing Linear Representation Hypothesis, positing a cylindrical structure where a central axis captures concept differences and a surrounding normal plane controls steering sensitivity. The hypothesis suggests that the intrinsic uncertainty in identifying specific sensitive sectors within this normal plane accounts for why steering outcomes frequently fluctuate even with well-aligned directions. Why it matters: This research offers a more robust theoretical framework for understanding and potentially improving the control and reliability of large language models.

Governing What the EU AI Act Excludes: Accountability for Autonomous AI Agents in Smart City Critical Infrastructure

arXiv · · Policy Ethics

This research paper identifies an accountability deficit for autonomous AI agents operating in smart city critical infrastructure under the EU AI Act, noting that specific provisions exclude safety-component AI from certain explanation rights and impact assessments. It proposes AgentGov-SC, a three-layer governance architecture specifying 25 measures, 5 conflict resolution rules, and an autonomy-calibrated activation model, with bidirectional traceability to established AI frameworks. A scenario analysis traces the governance activation through a multi-agent corridor cascade involving documented UAE smart-city systems. Why it matters: This paper addresses a significant regulatory gap in AI governance for complex, multi-agent systems in critical urban infrastructure, offering a novel architectural solution highly relevant to global smart city initiatives, including those in the Middle East.

Instruction-Guided Poetry Generation in Arabic and Its Dialects

arXiv · · NLP LLM

Researchers at MBZUAI have developed a new method for controllable poetry generation in Arabic and its dialects, moving beyond traditional analysis tasks for Arabic poetry within Large Language Models (LLMs). They introduce a large-scale, instruction-based dataset in Modern Standard Arabic (MSA) and various Arabic dialects, enabling LLMs to perform tasks like writing, revising, and continuing poems based on user criteria. Experiments show that fine-tuning LLMs on this dataset results in models capable of generating poetry aligned with user requirements, validated by automated metrics and human evaluation. Why it matters: This work represents a significant advancement in Arabic Natural Language Processing, offering tools for creative expression and cultural preservation while opening new avenues for user-guided content generation in culturally rich text forms.

Culturally Aware GenAI Risks for Youth: Perspectives from Youth, Parents, and Teachers in a Non-Western Context

arXiv · · Research Ethics

A study investigated the culturally aware risks of Generative AI for youth aged 7-17 in Saudi Arabia, focusing on privacy and safety challenges. Researchers analyzed 736 Reddit posts, 1,262 X (Twitter) posts, and conducted interviews with 31 Saudi participants including youth, parents, and teachers. Findings highlighted context-dependent risks, particularly regarding the disclosure of personal and family information that conflicts with culturally rooted expectations of modesty, privacy, and honor. The study proposes design implications for inclusive, context-sensitive parental controls that align with local cultural norms and values. Why it matters: This research is crucial for developing AI tools and policies that are culturally appropriate and safeguard youth in non-Western contexts like the Middle East.

Dual Pose-Graph Semantic Localization for Vision-Based Autonomous Drone Racing

arXiv · · Robotics CV

This work presents a dual pose-graph architecture for robust real-time localization in autonomous drone racing. The system fuses monocular visual-inertial odometry with semantic gate detections, using a temporary graph to optimize multiple observations into refined constraints before promoting them to a persistent main graph. Evaluated on the TII-RATM dataset and deployed in the A2RL competition, it achieved a 56-74% reduction in Absolute Trajectory Error (ATE) compared to standalone VIO and reduced odometry drift by up to 4.2 meters per lap. Why it matters: This research significantly improves the reliability and accuracy of vision-based localization for high-speed autonomous drones, crucial for advanced robotics applications and competitive racing.

State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

arXiv · · NLP LLM

Arabic-DeepSeek-R1 is an application-driven, open-source Arabic Large Language Model (LLM) that has achieved a new state-of-the-art (SOTA) across the Open Arabic LLM Leaderboard (OALL). The model utilizes a sparse Mixture-of-Experts (MoE) backbone and a four-phase Chain-of-Thought (CoT) distillation scheme, which incorporates Arabic-specific linguistic verification and regional ethical norms. It records the highest average score on the OALL suite and outperforms proprietary frontier systems like GPT-5.1 on a majority of benchmarks evaluating comprehensive Arabic language-specific tasks. Why it matters: This work offers a validated and cost-effective framework for developing high-performing, culturally-grounded AI for under-represented languages, addressing the digital equity gap.

Severity-Aware Weighted Loss for Arabic Medical Text Generation

arXiv · · NLP LLM

Researchers proposed a severity-aware weighted loss method to fine-tune Arabic language models for medical text generation, prioritizing severe clinical cases. This approach utilizes soft severity probabilities, derived from an AraBERT-based classifier, to dynamically scale token-level loss contributions during optimization on the MAQA dataset. The method consistently improved performance across ten Arabic LLMs, with AraGPT2-Base increasing from 54.04% to 66.14% and AraGPT2-Medium from 59.16% to 67.18%. Why it matters: This novel fine-tuning strategy addresses a critical limitation in medical AI by enhancing the safety and reliability of Arabic medical large language models, particularly in high-stakes clinical scenarios.

Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

arXiv · · NLP Research

Researchers have developed OmniScore, a family of deterministic learned metrics designed to evaluate generative text as an alternative to Large Language Models (LLMs) used as judges. OmniScore leverages small parameter models (<1B) and was trained on approximately 564,000 synthetic instances across 107 languages, then evaluated using 8,617 manually annotated instances. It approximates LLM-judge behavior while offering low latency and consistency for various evaluation settings like reference-based and source-grounded assessments in tasks like QA, translation, and summarization. Why it matters: This development provides a practical, scalable, and reproducible method for multilingual generative text evaluation, addressing key limitations of LLM-as-a-judge approaches and offering significant benefits for AI development in linguistically diverse regions.

Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation

arXiv · · LLM NLP

QIMMA is introduced as a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. It employs a multi-model assessment pipeline combining automated LLM judgment with human review to identify and resolve quality issues in established Arabic benchmarks. The resulting evaluation suite comprises over 52,000 samples, predominantly grounded in native Arabic content, with transparent implementation via LightEval and EvalPlus. Why it matters: This initiative provides a more reliable and reproducible foundation for evaluating Arabic Large Language Models, addressing critical quality concerns in existing benchmarks.

World Reasoning Arena

arXiv · · Research LLM

Researchers from MBZUAI have introduced WR-Arena, a new comprehensive benchmark designed to evaluate World Models (WMs) beyond traditional next-state prediction and visual fidelity. WR-Arena assesses WMs across three core dimensions: Action Simulation Fidelity, Long-horizon Forecast, and Simulative Reasoning and Planning, using a curated task taxonomy and diverse datasets. Extensive experiments with state-of-the-art WMs revealed a significant gap between current models' capabilities and human-level hypothetical reasoning. Why it matters: This benchmark provides a critical diagnostic tool and guideline for developing more robust and intelligent world models capable of advanced understanding, forecasting, and purposeful action, particularly for AI research in the region.

Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

arXiv · · NLP LLM

Researchers developed a retrieval-augmented generation (RAG) framework to improve Arabic Large Language Models (LLMs) in understanding complex historical and religious texts like the Quran and Hadith. This framework grounds LLMs in the Doha Historical Dictionary of Arabic (DHDA) through hybrid retrieval and intent-based routing. The approach significantly boosted the accuracy of Arabic-native LLMs such as Fanar and ALLaM to over 85%, closing the performance gap with proprietary models like Gemini. Why it matters: This research offers a novel method for enhancing Arabic NLP capabilities for historically nuanced texts, demonstrating the value of integrating diachronic lexicographic resources into RAG systems for deeper language understanding.

CoVR-R:Reason-Aware Composed Video Retrieval

arXiv · · CV RL

A new approach to composed video retrieval (CoVR) is presented, which leverages large multimodal models to infer causal and temporal consequences implied by an edit. The method aligns reasoned queries to candidate videos without task-specific finetuning. A new benchmark, CoVR-Reason, is introduced to evaluate reasoning in CoVR.

Fanar 2.0: Arabic Generative AI Stack

arXiv · · LLM Arabic AI

Hamad Bin Khalifa University (HBKU) has released Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform, built entirely at QCRI. The core of Fanar 2.0 is Fanar-27B, which was continually pre-trained from a Gemma-3-27B backbone using 120 billion high-quality tokens and only 256 NVIDIA H100 GPUs. Fanar 2.0 includes capabilities like FanarGuard, Aura, Oryx, Fanar-Sadiq, Fanar-Diwan, and FanarShaheen for moderation, speech recognition, vision understanding, Islamic content, poetry generation, and translation. Why it matters: This shows that sovereign, resource-constrained AI development in the Arabic language is possible, producing competitive systems in the region.

SectEval: Evaluating the Latent Sectarian Preferences of Large Language Models

arXiv · · NLP LLM

The paper introduces SectEval, a new benchmark to evaluate sectarian biases in LLMs concerning Sunni and Shia Islam, available in English and Hindi. Results show significant inconsistencies in LLM responses based on language, with some models favoring Shia responses in English but Sunni in Hindi. Location-based experiments further reveal that advanced models adapt their responses based on the user's claimed country, while smaller models exhibit a consistent Sunni-leaning bias.

Reinforcement learning-based dynamic cleaning scheduling framework for solar energy system

arXiv · · RL Robotics

This study introduces a reinforcement learning (RL) framework using Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) to optimize the cleaning schedules of photovoltaic panels in arid regions. Applied to a case study in Abu Dhabi, the PPO-based framework demonstrated up to 13% cost savings compared to simulation optimization methods by dynamically adjusting cleaning intervals based on environmental conditions. The research highlights the potential of RL in enhancing the efficiency and reducing the operational costs of solar power generation.

Robust Tightly-Coupled Filter-Based Monocular Visual-Inertial State Estimation and Graph-Based Evaluation for Autonomous Drone Racing

arXiv · · Robotics Research

This paper introduces ADR-VINS, a monocular visual-inertial state estimation framework based on an Error-State Kalman Filter (ESKF) designed for autonomous drone racing, integrating direct pixel reprojection errors from gate corners as innovation terms. It also introduces ADR-FGO, an offline Factor-Graph Optimization framework for generating high-fidelity reference trajectories for post-flight evaluation in GNSS-denied environments. Validated on the TII-RATM dataset, ADR-VINS achieved an average RMS translation error of 0.134 m and was successfully deployed in the A2RL Drone Championship Season 2. Why it matters: The framework provides a robust and efficient solution for drone state estimation in challenging racing environments, and enables performance evaluation without relying on external localization systems.

Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation

arXiv · · NLP LLM

MBZUAI researchers have developed an automatic interview system that uses LLMs to elicit nuanced, role-specific information from job candidates, improving early-stage hiring decisions. The system updates its belief about an applicant's rubric-oriented latent traits in a calibrated way based on their interview performance. Evaluation on simulated interviews showed the system's belief converges towards the simulated applicants' constructed ability levels.

ILION: Deterministic Pre-Execution Safety Gates for Agentic AI Systems

arXiv · · RL Ethics

The paper introduces ILION, a deterministic execution gate designed to ensure the safety of autonomous AI agents by classifying proposed actions as either BLOCK or ALLOW. ILION uses a five-component cascade architecture that operates without statistical training, API dependencies, or labeled data. Evaluation against existing text-safety infrastructures demonstrates ILION's superior performance in preventing unauthorized actions, achieving an F1 score of 0.8515 with sub-millisecond latency.

ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

arXiv · · NLP LLM

The paper introduces ArabicNumBench, a benchmark for evaluating LLMs on Arabic number reading using both Eastern and Western Arabic numerals. It evaluates 71 models from 10 providers on 210 number reading tasks, using zero-shot, zero-shot CoT, few-shot, and few-shot CoT prompting strategies. The results show substantial performance variation, with few-shot CoT prompting achieving 2.8x higher accuracy than zero-shot approaches. Why it matters: The benchmark establishes baselines for Arabic number comprehension and provides guidance for model selection in production Arabic NLP systems.

ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning

arXiv · · NLP Arabic AI

The paper introduces ALPS (Arabic Linguistic & Pragmatic Suite), a diagnostic challenge set for evaluating deep semantics and pragmatics in Arabic NLP. The dataset contains 531 expert-curated questions across 15 tasks and 47 subtasks, designed to test morpho-syntactic dependencies and compositional semantics. Evaluation of 23 models, including commercial, open-source, and Arabic-native models, reveals that models struggle with fundamental morpho-syntactic dependencies, especially those reliant on diacritics. Why it matters: ALPS provides a valuable benchmark for evaluating the linguistic competence of Arabic NLP models, highlighting areas where current models fall short despite achieving high fluency.

From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

arXiv · · NLP LLM

Arabic Language Models (LMs) are primarily pretrained on Modern Standard Arabic (MSA), with an expectation of transferring to diverse Arabic dialects for real-world applications. This work explores cross-lingual transfer in Arabic LMs using probing on three Natural Language Processing (NLP) tasks and representational similarity. The findings indicate that transfer is possible but disproportionate across dialects, with some evidence of negative interference in models trained to support all Arabic dialects. Why it matters: This research highlights crucial challenges for building robust Arabic AI systems that effectively handle the significant linguistic diversity of the Arab world.

SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

arXiv · · NLP LLM

The paper introduces SalamahBench, a new benchmark for evaluating the safety of Arabic Language Models (ALMs). The benchmark comprises 8,170 prompts across 12 categories aligned with the MLCommons Safety Hazard Taxonomy. Five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, were evaluated using the benchmark. Why it matters: The benchmark enables standardized, category-aware safety evaluation, highlighting the necessity of specialized safeguard mechanisms for robust harm mitigation in ALMs.

DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding

arXiv · · CV NLP

MBZUAI researchers introduce DuwatBench, a new benchmark for multimodal understanding of Arabic calligraphy. The dataset contains 1,272 samples across six calligraphic styles with detailed annotations to evaluate visual-text alignment. Evaluation of 13 multimodal models reveals challenges in processing calligraphic variations and artistic distortions, highlighting the need for culturally grounded AI research.

Generative AI in Saudi Arabia: A National Survey of Adoption, Risks, and Public Perceptions

arXiv · · Research Policy

A national survey in Saudi Arabia of 330 participants reveals that 93% are actively using Generative AI, primarily for text-based tasks, while awareness and understanding remain uneven. Participants recognize benefits like productivity but caution against risks such as privacy, misinformation, and ethical misuse. The study highlights the need for AI literacy, culturally aligned solutions, and stronger frameworks for responsible deployment in Saudi Arabia.

MonoRace: Winning Champion-Level Drone Racing with Robust Monocular AI

arXiv · · Robotics RL

The paper presents MonoRace, an onboard drone racing approach using a monocular camera and IMU. The system combines neural-network-based gate segmentation with a drone model for robust state estimation, along with offline optimization using gate geometry. MonoRace won the 2025 Abu Dhabi Autonomous Drone Racing Competition (A2RL), outperforming AI teams and human world champions, reaching speeds up to 100 km/h. Why it matters: This demonstrates a significant advancement in autonomous drone racing, achieving champion-level performance with a resource-efficient monocular system, validated in a real-world competition setting in the UAE.

Hybrid Deep Feature Extraction and ML for Construction and Demolition Debris Classification

arXiv · · CV Research

This paper introduces a hybrid deep learning and machine learning pipeline for classifying construction and demolition waste. A dataset of 1,800 images from UAE construction sites was created, and deep features were extracted using a pre-trained Xception network. The combination of Xception features with machine learning classifiers achieved up to 99.5% accuracy, demonstrating state-of-the-art performance for debris identification.

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

arXiv · · LLM RL

The paper introduces Yet another Policy Optimization (YaPO), a reference-free method for learning sparse steering vectors in the latent space of a Sparse Autoencoder (SAE) to steer LLMs. By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Experiments show YaPO converges faster, achieves stronger performance, exhibits improved training stability and preserves general knowledge compared to dense steering baselines.

Community-Based Early-Stage Chronic Kidney Disease Screening using Explainable Machine Learning for Low-Resource Settings

arXiv · · Research Healthcare

This paper introduces an explainable machine learning framework for early-stage chronic kidney disease (CKD) screening, specifically designed for low-resource settings in Bangladesh and South Asia. The framework utilizes a community-based dataset from Bangladesh and evaluates multiple ML classifiers with feature selection techniques. Results show that the ML models achieve high accuracy and sensitivity, outperforming existing screening tools and demonstrating strong generalizability across independent datasets from India, the UAE, and Bangladesh.

Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

arXiv · · NLP Arabic AI

The paper introduces Ara-HOPE, a human-centric post-editing evaluation framework for Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation. Ara-HOPE includes a five-category error taxonomy and a decision-tree annotation protocol designed to address the challenges of dialect-specific MT errors. Evaluation of Jais, GPT-3.5, and NLLB-200 shows dialect-specific terminology and semantic preservation remain key challenges. Why it matters: The new framework and public dataset will help improve the evaluation and development of dialect-aware MT systems for Arabic.

Drift-Corrected Monocular VIO and Perception-Aware Planning for Autonomous Drone Racing

arXiv · · Robotics RL

This paper details the autonomous drone racing system developed for the Abu Dhabi Autonomous Racing League (A2RL) x Drone Champions League competition. The system uses drift-corrected monocular Visual-Inertial Odometry (VIO) fused with YOLO-based gate detection for global position measurements, managed via Kalman filter. A perception-aware planner generates trajectories balancing speed and gate visibility. Why it matters: The system's podium finishes validate the effectiveness of monocular vision-based autonomous drone flight and showcases advancements in AI-powered robotics within the UAE.

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

arXiv · · CV Research

The paper introduces the Prism Hypothesis, which posits a correspondence between an encoder's feature spectrum and its functional role, with semantic encoders capturing low-frequency components and pixel encoders retaining high-frequency information. Based on this, the authors propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details using a frequency-band modulator. Experiments on ImageNet and MS-COCO demonstrate that UAE effectively unifies semantic abstraction and pixel-level fidelity, achieving state-of-the-art performance.

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv · · NLP Arabic AI

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv · · CV NLP

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region

arXiv · · CV Research

A new study compares vision-language models (VLMs) to YOLOv8 for wastewater treatment plant (WWTP) identification in satellite imagery across the MENA region. VLMs like Gemma-3 demonstrate superior zero-shot performance compared to YOLOv8, trained on a dataset of 83,566 satellite images from Egypt, Saudi Arabia, and UAE. The research suggests VLMs offer a scalable, annotation-free alternative for remote sensing of WWTPs.

OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving

arXiv · · CV RL

The paper introduces OmniGen, a unified framework for generating aligned multimodal sensor data for autonomous driving using a shared Bird's Eye View (BEV) space. It uses a novel generalizable multimodal reconstruction method (UAE) to jointly decode LiDAR and multi-view camera data through volume rendering. The framework incorporates a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation, demonstrating good performance and multimodal consistency.

Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

arXiv · · CV RL

Researchers at MBZUAI have introduced Video-R2, a reinforcement learning approach to improve the consistency and visual grounding of reasoning in multimodal language models. Video-R2 combines timestamp-aware supervised fine-tuning with Group Relative Policy Optimization (GRPO) guided by a Temporal Alignment Reward (TAR). The model demonstrates higher Think Answer Consistency (TAC), Video Attention Score (VAS), and accuracy across multiple benchmarks, showing improved temporal alignment and reasoning coherence for video understanding.

Video-CoM: Interactive Video Reasoning via Chain of Manipulations

arXiv · · CV RL

Researchers at MBZUAI introduce "Interactive Video Reasoning," a new paradigm enabling models to actively "think with videos" by performing iterative visual actions to gather and refine evidence. They developed Video CoM, which reasons through a Chain of Manipulations (CoM), and constructed Video CoM Instruct, an 18K instruction tuning dataset for multi-step manipulation reasoning. The model is further optimized via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO), achieving strong results across nine video reasoning benchmarks.

FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models

arXiv · · NLP LLM

The paper introduces FanarGuard, a bilingual moderation filter for Arabic and English language models that considers both safety and cultural alignment. A dataset of 468K prompt-response pairs was created and scored by LLM judges on harmlessness and cultural awareness to train the filter. The first benchmark targeting Arabic cultural contexts was developed to evaluate cultural alignment. Why it matters: FanarGuard advances context-sensitive AI safeguards by integrating cultural awareness into content moderation, addressing a critical gap in current alignment techniques.

Datacenters in the Desert: Feasibility and Sustainability of LLM Inference in the Middle East

arXiv · · Research LLM

This paper analyzes the energy consumption and carbon footprint of LLM inference in the UAE compared to Iceland, Germany, and the USA. The study uses DeepSeek Coder 1.3B and the HumanEval dataset to evaluate code generation. It provides a comparative analysis of geographical trade-offs for climate-aware AI deployment, specifically addressing the challenges and potential of datacenters in desert regions.

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

arXiv · · LLM CV

Researchers at MBZUAI have introduced EvoLMM, a self-evolving framework for large multimodal models that enhances reasoning capabilities without human-annotated data or reward distillation. EvoLMM uses two cooperative agents, a Proposer and a Solver, which generate image-grounded questions and solve them through internal consistency, using a continuous self-rewarding process. Evaluations using Qwen2.5-VL as the base model showed performance gains of up to 3% on multimodal math-reasoning benchmarks like ChartQA, MathVista, and MathVision using only raw training images.

Sovereign AI: Rethinking Autonomy in the Age of Global Interdependence

arXiv · · Policy Research

This paper proposes a framework for understanding AI sovereignty as a balance between autonomy and interdependence, considering global data, supply chains, and standards. It introduces a planner's model with policy heuristics for equalizing marginal returns across sovereignty pillars and setting openness. The model is applied to India and the Middle East (Saudi Arabia and UAE), finding that managed interdependence, rather than isolation, is key for AI sovereignty.

MMRINet: Efficient Mamba-Based Segmentation with Dual-Path Refinement for Low-Resource MRI Analysis

arXiv · · Research Healthcare

Researchers from MBZUAI have developed MMRINet, a Mamba-based neural network for efficient brain tumor segmentation in MRI scans. The model uses Dual-Path Feature Refinement and Progressive Feature Aggregation to achieve high accuracy with only 2.5M parameters, making it suitable for low-resource clinical environments. MMRINet achieves a Dice score of 0.752 and HD95 of 12.23 on the BraTS-Lighthouse SSA 2025 benchmark.

The Future of AI in the GCC Post-NPM Landscape: A Comparative Analysis of Kuwait and the UAE

arXiv · · Policy Research

This study compares AI uptake in the UAE and Kuwait, analyzing how constitutional, collective-choice, and operational rules shape AI implementation and its impact on citizen centricity and public value creation. It finds that the UAE's concentrated authority and pro-innovation environment enable scaling AI initiatives, while Kuwait's dispersed governance and cautious approach limit progress despite similar resources. The research highlights the importance of vertical rule coherence over wealth in determining AI's public-value yield.

Cross-Document Topic-Aligned Chunking for Retrieval-Augmented Generation

arXiv · · NLP LLM

This paper introduces Cross-Document Topic-Aligned (CDTA) chunking to address knowledge fragmentation in Retrieval-Augmented Generation (RAG) systems. CDTA identifies topics across documents, maps segments to topics, and synthesizes them into unified chunks. Experiments on HotpotQA and UAE legal texts show that CDTA improves faithfulness and citation accuracy compared to existing chunking methods, especially for complex queries requiring multi-hop reasoning.

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

arXiv · · NLP LLM

A new method is proposed to reduce the verbosity of LLMs in step-by-step reasoning by retaining moderately easy problems during Reinforcement Learning with Verifiable Rewards (RLVR) training. This approach acts as an implicit length regularizer, preventing the model from excessively increasing output length on harder problems. Experiments using Qwen3-4B-Thinking-2507 show the model achieves baseline accuracy with nearly twice shorter solutions.

LLM-based Multi-class Attack Analysis and Mitigation Framework in IoT/IIoT Networks

arXiv · · Research NLP

This paper introduces a framework that combines machine learning for multi-class attack detection in IoT/IIoT networks with large language models (LLMs) for attack behavior analysis and mitigation suggestion. The framework uses role-play prompt engineering with RAG to guide LLMs like ChatGPT-o3 and DeepSeek-R1, and introduces new evaluation metrics for quantitative assessment. Experiments using Edge-IIoTset and CICIoT2023 datasets showed Random Forest as the best detection model and ChatGPT-o3 outperforming DeepSeek-R1 in attack analysis and mitigation.

Climate Adaptation-Aware Flood Prediction for Coastal Cities Using Deep Learning

arXiv · · Research CV

Researchers have developed a CNN-based deep learning model for predicting coastal flooding in cities under various sea-level rise scenarios. The model utilizes a vision-based, low-resource DL framework and is trained on datasets from Abu Dhabi and San Francisco. Results show a 20% reduction in mean absolute error compared to existing methods, demonstrating potential for scalable coastal flood management.

Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding

arXiv · · NLP LLM

MASARAT SA has developed Mubeen, a proprietary Arabic language model specializing in Arabic linguistics, Islamic studies, and cultural heritage. Mubeen was trained using native Arabic sources, including digitized historical manuscripts processed via a proprietary Arabic OCR engine. The model employs a Practical Closure Architecture to improve user intent understanding and provide decisive guidance. Why it matters: Mubeen addresses the utility gap in current Arabic LLMs by focusing on native Arabic data and cultural authenticity, which is critical for heritage preservation and alignment with Saudi Vision 2030.