GCC AI Research

Archive Monthly

June 2025

7 articles

Top Stories

Forget-MI: Machine Unlearning for Forgetting Multimodal Information in Healthcare Settings

arXiv · · Healthcare Research

Researchers from MBZUAI introduce Forget-MI, a machine unlearning method tailored for multimodal medical data, enhancing privacy by removing specific patient data from AI models. Forget-MI utilizes loss functions and perturbation techniques to unlearn both unimodal and joint data representations. The method demonstrates superior performance in reducing Membership Inference Attacks and improving data removal compared to existing techniques, while preserving overall model performance and enabling data forgetting.

MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

arXiv · · Research NLP

This paper introduces MOTOR, a multimodal retrieval and re-ranking approach for medical visual question answering (MedVQA) that uses grounded captions and optimal transport to capture relationships between queries and retrieved context, leveraging both textual and visual information. MOTOR identifies clinically relevant contexts to augment VLM input, achieving higher accuracy on MedVQA datasets. Empirical analysis shows MOTOR outperforms state-of-the-art methods by an average of 6.45%.

Profiling News Media for Factuality and Bias Using LLMs and the Fact-Checking Methodology of Human Experts

arXiv · · NLP LLM

A new methodology emulating fact-checker criteria assesses news outlet factuality and bias using LLMs. The approach uses prompts based on fact-checking criteria to elicit and aggregate LLM responses for predictions. Experiments demonstrate improvements over baselines, with error analysis on media popularity and region, and a released dataset/code at https://github.com/mbzuai-nlp/llm-media-profiling.

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

arXiv · · NLP LLM

A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.

TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

arXiv · · Research CV

MBZUAI researchers introduce TerraFM, a scalable self-supervised learning model for Earth observation that uses Sentinel-1 and Sentinel-2 imagery. The model unifies radar and optical inputs through modality-specific patch embeddings and adaptive cross-attention fusion. TerraFM achieves strong generalization on classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

arXiv · · CV LLM

Researchers from MBZUAI have introduced VideoMolmo, a large multimodal model for spatio-temporal pointing conditioned on textual descriptions. The model incorporates a temporal module with an attention mechanism and a temporal mask fusion pipeline using SAM2 for improved coherence across video sequences. They also curated a dataset of 72k video-caption pairs and introduced VPoS-Bench, a benchmark for evaluating generalization across real-world scenarios, with code and models publicly available.

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

arXiv · · Research CV

MBZUAI researchers introduce VideoMathQA, a new benchmark for evaluating mathematical reasoning in videos, requiring models to interpret visual information, text, and spoken cues. The dataset spans 10 mathematical domains with videos ranging from 10 seconds to over 1 hour, and includes multi-step reasoning annotations. The benchmark aims to evaluate temporal cross-modal reasoning and highlights the limitations of existing approaches in complex video-based mathematical problem solving.