A new approach to composed video retrieval (CoVR) is presented, which leverages large multimodal models to infer causal and temporal consequences implied by an edit. The method aligns reasoned queries to candidate videos without task-specific finetuning. A new benchmark, CoVR-Reason, is introduced to evaluate reasoning in CoVR.
The paper introduces SectEval, a new benchmark to evaluate sectarian biases in LLMs concerning Sunni and Shia Islam, available in English and Hindi. Results show significant inconsistencies in LLM responses based on language, with some models favoring Shia responses in English but Sunni in Hindi. Location-based experiments further reveal that advanced models adapt their responses based on the user's claimed country, while smaller models exhibit a consistent Sunni-leaning bias.
This study introduces a reinforcement learning (RL) framework using Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) to optimize the cleaning schedules of photovoltaic panels in arid regions. Applied to a case study in Abu Dhabi, the PPO-based framework demonstrated up to 13% cost savings compared to simulation optimization methods by dynamically adjusting cleaning intervals based on environmental conditions. The research highlights the potential of RL in enhancing the efficiency and reducing the operational costs of solar power generation.
MBZUAI researchers have developed an automatic interview system that uses LLMs to elicit nuanced, role-specific information from job candidates, improving early-stage hiring decisions. The system updates its belief about an applicant's rubric-oriented latent traits in a calibrated way based on their interview performance. Evaluation on simulated interviews showed the system's belief converges towards the simulated applicants' constructed ability levels.
The paper introduces ILION, a deterministic execution gate designed to ensure the safety of autonomous AI agents by classifying proposed actions as either BLOCK or ALLOW. ILION uses a five-component cascade architecture that operates without statistical training, API dependencies, or labeled data. Evaluation against existing text-safety infrastructures demonstrates ILION's superior performance in preventing unauthorized actions, achieving an F1 score of 0.8515 with sub-millisecond latency.
MBZUAI researchers introduce DuwatBench, a new benchmark for multimodal understanding of Arabic calligraphy. The dataset contains 1,272 samples across six calligraphic styles with detailed annotations to evaluate visual-text alignment. Evaluation of 13 multimodal models reveals challenges in processing calligraphic variations and artistic distortions, highlighting the need for culturally grounded AI research.
A national survey in Saudi Arabia of 330 participants reveals that 93% are actively using Generative AI, primarily for text-based tasks, while awareness and understanding remain uneven. Participants recognize benefits like productivity but caution against risks such as privacy, misinformation, and ethical misuse. The study highlights the need for AI literacy, culturally aligned solutions, and stronger frameworks for responsible deployment in Saudi Arabia.
This paper introduces a hybrid deep learning and machine learning pipeline for classifying construction and demolition waste. A dataset of 1,800 images from UAE construction sites was created, and deep features were extracted using a pre-trained Xception network. The combination of Xception features with machine learning classifiers achieved up to 99.5% accuracy, demonstrating state-of-the-art performance for debris identification.
The paper introduces Yet another Policy Optimization (YaPO), a reference-free method for learning sparse steering vectors in the latent space of a Sparse Autoencoder (SAE) to steer LLMs. By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Experiments show YaPO converges faster, achieves stronger performance, exhibits improved training stability and preserves general knowledge compared to dense steering baselines.
This paper introduces an explainable machine learning framework for early-stage chronic kidney disease (CKD) screening, specifically designed for low-resource settings in Bangladesh and South Asia. The framework utilizes a community-based dataset from Bangladesh and evaluates multiple ML classifiers with feature selection techniques. Results show that the ML models achieve high accuracy and sensitivity, outperforming existing screening tools and demonstrating strong generalizability across independent datasets from India, the UAE, and Bangladesh.
The paper introduces the Prism Hypothesis, which posits a correspondence between an encoder's feature spectrum and its functional role, with semantic encoders capturing low-frequency components and pixel encoders retaining high-frequency information. Based on this, the authors propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details using a frequency-band modulator. Experiments on ImageNet and MS-COCO demonstrate that UAE effectively unifies semantic abstraction and pixel-level fidelity, achieving state-of-the-art performance.
A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.
A new study compares vision-language models (VLMs) to YOLOv8 for wastewater treatment plant (WWTP) identification in satellite imagery across the MENA region. VLMs like Gemma-3 demonstrate superior zero-shot performance compared to YOLOv8, trained on a dataset of 83,566 satellite images from Egypt, Saudi Arabia, and UAE. The research suggests VLMs offer a scalable, annotation-free alternative for remote sensing of WWTPs.
The paper introduces OmniGen, a unified framework for generating aligned multimodal sensor data for autonomous driving using a shared Bird's Eye View (BEV) space. It uses a novel generalizable multimodal reconstruction method (UAE) to jointly decode LiDAR and multi-view camera data through volume rendering. The framework incorporates a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation, demonstrating good performance and multimodal consistency.
Researchers at MBZUAI have introduced Video-R2, a reinforcement learning approach to improve the consistency and visual grounding of reasoning in multimodal language models. Video-R2 combines timestamp-aware supervised fine-tuning with Group Relative Policy Optimization (GRPO) guided by a Temporal Alignment Reward (TAR). The model demonstrates higher Think Answer Consistency (TAC), Video Attention Score (VAS), and accuracy across multiple benchmarks, showing improved temporal alignment and reasoning coherence for video understanding.
Researchers at MBZUAI introduce "Interactive Video Reasoning," a new paradigm enabling models to actively "think with videos" by performing iterative visual actions to gather and refine evidence. They developed Video CoM, which reasons through a Chain of Manipulations (CoM), and constructed Video CoM Instruct, an 18K instruction tuning dataset for multi-step manipulation reasoning. The model is further optimized via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO), achieving strong results across nine video reasoning benchmarks.
This paper analyzes the energy consumption and carbon footprint of LLM inference in the UAE compared to Iceland, Germany, and the USA. The study uses DeepSeek Coder 1.3B and the HumanEval dataset to evaluate code generation. It provides a comparative analysis of geographical trade-offs for climate-aware AI deployment, specifically addressing the challenges and potential of datacenters in desert regions.
Researchers at MBZUAI have introduced EvoLMM, a self-evolving framework for large multimodal models that enhances reasoning capabilities without human-annotated data or reward distillation. EvoLMM uses two cooperative agents, a Proposer and a Solver, which generate image-grounded questions and solve them through internal consistency, using a continuous self-rewarding process. Evaluations using Qwen2.5-VL as the base model showed performance gains of up to 3% on multimodal math-reasoning benchmarks like ChartQA, MathVista, and MathVision using only raw training images.
This paper proposes a framework for understanding AI sovereignty as a balance between autonomy and interdependence, considering global data, supply chains, and standards. It introduces a planner's model with policy heuristics for equalizing marginal returns across sovereignty pillars and setting openness. The model is applied to India and the Middle East (Saudi Arabia and UAE), finding that managed interdependence, rather than isolation, is key for AI sovereignty.
Researchers from MBZUAI have developed MMRINet, a Mamba-based neural network for efficient brain tumor segmentation in MRI scans. The model uses Dual-Path Feature Refinement and Progressive Feature Aggregation to achieve high accuracy with only 2.5M parameters, making it suitable for low-resource clinical environments. MMRINet achieves a Dice score of 0.752 and HD95 of 12.23 on the BraTS-Lighthouse SSA 2025 benchmark.
This study compares AI uptake in the UAE and Kuwait, analyzing how constitutional, collective-choice, and operational rules shape AI implementation and its impact on citizen centricity and public value creation. It finds that the UAE's concentrated authority and pro-innovation environment enable scaling AI initiatives, while Kuwait's dispersed governance and cautious approach limit progress despite similar resources. The research highlights the importance of vertical rule coherence over wealth in determining AI's public-value yield.
This paper introduces Cross-Document Topic-Aligned (CDTA) chunking to address knowledge fragmentation in Retrieval-Augmented Generation (RAG) systems. CDTA identifies topics across documents, maps segments to topics, and synthesizes them into unified chunks. Experiments on HotpotQA and UAE legal texts show that CDTA improves faithfulness and citation accuracy compared to existing chunking methods, especially for complex queries requiring multi-hop reasoning.
A new method is proposed to reduce the verbosity of LLMs in step-by-step reasoning by retaining moderately easy problems during Reinforcement Learning with Verifiable Rewards (RLVR) training. This approach acts as an implicit length regularizer, preventing the model from excessively increasing output length on harder problems. Experiments using Qwen3-4B-Thinking-2507 show the model achieves baseline accuracy with nearly twice shorter solutions.
This paper introduces a framework that combines machine learning for multi-class attack detection in IoT/IIoT networks with large language models (LLMs) for attack behavior analysis and mitigation suggestion. The framework uses role-play prompt engineering with RAG to guide LLMs like ChatGPT-o3 and DeepSeek-R1, and introduces new evaluation metrics for quantitative assessment. Experiments using Edge-IIoTset and CICIoT2023 datasets showed Random Forest as the best detection model and ChatGPT-o3 outperforming DeepSeek-R1 in attack analysis and mitigation.
Researchers have developed a CNN-based deep learning model for predicting coastal flooding in cities under various sea-level rise scenarios. The model utilizes a vision-based, low-resource DL framework and is trained on datasets from Abu Dhabi and San Francisco. Results show a 20% reduction in mean absolute error compared to existing methods, demonstrating potential for scalable coastal flood management.
Researchers introduce MATRIX, a vision-centric agent tuning framework for robust tool-use reasoning in VLMs. The framework includes M-TRACE, a dataset of 28.5K multimodal tasks with 177K verified trajectories, and Pref-X, a set of 11K automatically generated preference pairs. Experiments show MATRIX consistently outperforms open- and closed-source VLMs across three benchmarks.
The paper introduces TimeHUT, a new method for learning time-series representations using hierarchical uniformity-tolerance balancing of contrastive representations. TimeHUT employs a hierarchical setup to learn both instance-wise and temporal information, along with a temperature scheduler to balance uniformity and tolerance. The method was evaluated on UCR, UAE, Yahoo, and KPI datasets, demonstrating superior performance in classification tasks and competitive results in anomaly detection.
Researchers introduce ALARB, a new benchmark for evaluating reasoning in Arabic LLMs using 13K Saudi commercial court cases. The benchmark includes tasks like verdict prediction, reasoning chain completion, and identification of relevant regulations. Instruction-tuning a 12B parameter model on ALARB achieves performance comparable to GPT-4o in verdict prediction and generation.
The study analyzes over 1,000 images generated by ImageFX, DALL-E V3, and Grok for 56 Saudi professions, finding significant gender imbalances and cultural inaccuracies. DALL-E V3 exhibited the strongest gender stereotyping, with 96% male depictions, particularly in leadership and technical roles. The research underscores the need for diverse training data and culturally sensitive evaluation to ensure equitable AI outputs that accurately reflect Saudi Arabia's labor market and culture.
Researchers from MBZUAI have introduced SPECS, a new reference-free evaluation metric for long image captions that modifies CLIP to emphasize specificity. SPECS aims to improve the correlation with human judgment while maintaining computational efficiency compared to LLM-based metrics. The proposed approach is intended for iterative use during image captioning model development, offering a practical alternative to existing methods.
The researchers introduce KAU-CSSL, the first continuous Saudi Sign Language (SSL) dataset focusing on complete sentences. They propose a transformer-based model using ResNet-18 for spatial feature extraction and a Transformer Encoder with Bidirectional LSTM for temporal dependencies. The model achieved 99.02% accuracy in signer-dependent mode and 77.71% in signer-independent mode, advancing communication tools for the SSL community.
This paper presents a UI-level evaluation of ALLaM-34B, an Arabic-centric LLM developed by SDAIA and deployed in the HUMAIN Chat service. The evaluation used a prompt pack spanning various Arabic dialects, code-switching, reasoning, and safety, with outputs scored by frontier LLM judges. Results indicate strong performance in generation, code-switching, MSA handling, reasoning, and improved dialect fidelity, positioning ALLaM-34B as a robust Arabic LLM suitable for real-world use.
This paper introduces Saudi-Dialect-ALLaM, a LoRA fine-tuned version of the Saudi Arabian foundation model ALLaM-7B-Instruct-preview, designed to improve the generation of Saudi dialects (Najdi and Hijazi). The model is trained on a private dataset of 5,466 synthetic instruction-response pairs, with two variants explored: Dialect-Token and No-Token training. Results indicate that the Dialect-Token model achieves superior dialect control and fidelity compared to generic instruction models, although the dataset and model weights are not released.
Researchers introduce UnsafeChain, a new safety alignment dataset designed to improve the safety of large reasoning models (LRMs) by focusing on 'hard prompts' that elicit harmful outputs. The dataset identifies and corrects unsafe completions into safe responses, exposing models to unsafe behaviors and guiding their correction. Fine-tuning LRMs on UnsafeChain demonstrates enhanced safety and preservation of general reasoning ability compared to existing datasets like SafeChain and STAR-1.
This paper introduces Absher, a new benchmark for evaluating LLMs' linguistic and cultural competence in Saudi dialects. The benchmark comprises over 18,000 multiple-choice questions spanning six categories, using dialectal words, phrases, and proverbs from various regions of Saudi Arabia. Evaluation of state-of-the-art LLMs reveals performance gaps, especially in cultural inference and contextual understanding, highlighting the need for dialect-aware training.
Researchers developed a two-stage AI pipeline to predict desalination performance efficiency losses due to climate factors in the UAE, achieving 98% accuracy. The model forecasts aerosol optical depth (AOD) and uses it to predict desalination efficiency, incorporating meteorological data. A dust-aware control logic was developed to optimize plant operations, and an interactive dashboard was created for decision support.
MBZUAI researchers have introduced MIRA, a novel framework for improving the factual accuracy of multimodal large language models in medical applications. MIRA uses calibrated retrieval to manage factual risk and integrates image embeddings with a medical knowledge base for efficient reasoning. Evaluated on medical VQA and report generation benchmarks, MIRA achieves state-of-the-art results, with code available on GitHub.
The paper introduces ScoreAdv, a novel approach for generating natural adversarial examples (UAEs) using diffusion models. It incorporates an adversarial guidance mechanism and saliency maps to shift the sampling distribution and inject visual information. Experiments on ImageNet and CelebA datasets demonstrate state-of-the-art attack success rates, image quality, and robustness against defenses.
Researchers from MBZUAI introduce Forget-MI, a machine unlearning method tailored for multimodal medical data, enhancing privacy by removing specific patient data from AI models. Forget-MI utilizes loss functions and perturbation techniques to unlearn both unimodal and joint data representations. The method demonstrates superior performance in reducing Membership Inference Attacks and improving data removal compared to existing techniques, while preserving overall model performance and enabling data forgetting.
This paper introduces MOTOR, a multimodal retrieval and re-ranking approach for medical visual question answering (MedVQA) that uses grounded captions and optimal transport to capture relationships between queries and retrieved context, leveraging both textual and visual information. MOTOR identifies clinically relevant contexts to augment VLM input, achieving higher accuracy on MedVQA datasets. Empirical analysis shows MOTOR outperforms state-of-the-art methods by an average of 6.45%.
A new methodology emulating fact-checker criteria assesses news outlet factuality and bias using LLMs. The approach uses prompts based on fact-checking criteria to elicit and aggregate LLM responses for predictions. Experiments demonstrate improvements over baselines, with error analysis on media popularity and region, and a released dataset/code at https://github.com/mbzuai-nlp/llm-media-profiling.
A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.
MBZUAI researchers introduce TerraFM, a scalable self-supervised learning model for Earth observation that uses Sentinel-1 and Sentinel-2 imagery. The model unifies radar and optical inputs through modality-specific patch embeddings and adaptive cross-attention fusion. TerraFM achieves strong generalization on classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.
MBZUAI researchers introduce VideoMathQA, a new benchmark for evaluating mathematical reasoning in videos, requiring models to interpret visual information, text, and spoken cues. The dataset spans 10 mathematical domains with videos ranging from 10 seconds to over 1 hour, and includes multi-step reasoning annotations. The benchmark aims to evaluate temporal cross-modal reasoning and highlights the limitations of existing approaches in complex video-based mathematical problem solving.
Researchers from MBZUAI have introduced VideoMolmo, a large multimodal model for spatio-temporal pointing conditioned on textual descriptions. The model incorporates a temporal module with an attention mechanism and a temporal mask fusion pipeline using SAM2 for improved coherence across video sequences. They also curated a dataset of 72k video-caption pairs and introduced VPoS-Bench, a benchmark for evaluating generalization across real-world scenarios, with code and models publicly available.
MBZUAI introduces Agent-X, a benchmark for evaluating multi-step reasoning in vision-centric agents across real-world, multimodal settings. Agent-X includes 828 tasks with diverse visual contexts and spans six environments, requiring tool use and stepwise decision-making. Experiments show that current LLMs struggle with multi-step vision tasks, achieving less than 50% success, highlighting areas for improvement in LMM reasoning and tool use.
MBZUAI researchers introduce SocialMaze, a new benchmark for evaluating social reasoning capabilities in large language models (LLMs). SocialMaze includes six diverse tasks across social reasoning games, daily-life interactions, and digital community platforms, emphasizing deep reasoning, dynamic interaction, and information uncertainty. Experiments show that LLMs vary in handling dynamic interactions, degrade under uncertainty, but can be improved via fine-tuning on curated reasoning examples.
MBZUAI researchers release 'Fann or Flop', a new benchmark for evaluating Arabic poetry understanding in LLMs. The benchmark covers 12 historical eras and 14 poetic genres, assessing semantic understanding, metaphor interpretation, and cultural context. Evaluation of state-of-the-art LLMs reveals challenges in poetic understanding despite strong performance on standard Arabic benchmarks.
MBZUAI researchers introduce ARB, the first comprehensive benchmark for evaluating step-by-step multimodal reasoning in Arabic across textual and visual modalities. The benchmark spans 11 diverse domains and includes 1,356 multimodal samples with 5,119 human-curated reasoning steps. Evaluations of 12 state-of-the-art LMMs revealed challenges in coherence, faithfulness, and cultural grounding, highlighting the need for culturally aware AI systems.
Researchers from MBZUAI have introduced UrduFactCheck, a new framework for fact-checking in Urdu, along with two datasets: UrduFactBench and UrduFactQA. The framework uses monolingual and translation-based evidence retrieval to address the lack of Urdu resources. Evaluations using twelve LLMs showed that translation-augmented methods improve performance, highlighting challenges for open-source LLMs in Urdu.