Skip to content
GCC AI Research

Topics

CV

1001–1050 articles · Page 21 RSS ↗

Meta will use AI to detect kids by analyzing height and bone structure - Gulf News

Gulf News News · · Ethics Policy

Meta is reportedly developing an AI system to detect the age of its users, particularly minors, by analyzing physical attributes such as height and bone structure. This technology aims to enhance age verification processes across Meta's platforms. The initiative seeks to bolster online safety measures for younger users and ensure compliance with age restrictions. Why it matters: This development signifies a major tech company's advanced use of AI for age verification, raising critical discussions about data privacy, the accuracy and ethical implications of biometric AI, and its global impact on child safety online, including within the Middle East.

Fake viral AI clip of India FM Sitharaman on high returns exposed - Gulf News

Gulf News News · · Ethics Policy

A fake, AI-generated video clip of India's Finance Minister, Nirmala Sitharaman, promoting high financial returns has been identified and exposed. The misleading clip, which went viral, presented false information related to investment opportunities. This incident was reported by Gulf News, highlighting a regional awareness of such digital misinformation. Why it matters: This incident highlights the growing challenge of AI-generated deepfakes used for financial misinformation and fraud, emphasizing the need for robust detection and public awareness in the digital age.

Dual Pose-Graph Semantic Localization for Vision-Based Autonomous Drone Racing

arXiv · · Robotics CV

This work presents a dual pose-graph architecture for robust real-time localization in autonomous drone racing. The system fuses monocular visual-inertial odometry with semantic gate detections, using a temporary graph to optimize multiple observations into refined constraints before promoting them to a persistent main graph. Evaluated on the TII-RATM dataset and deployed in the A2RL competition, it achieved a 56-74% reduction in Absolute Trajectory Error (ATE) compared to standalone VIO and reduced odometry drift by up to 4.2 meters per lap. Why it matters: This research significantly improves the reliability and accuracy of vision-based localization for high-speed autonomous drones, crucial for advanced robotics applications and competitive racing.

CoVR-R:Reason-Aware Composed Video Retrieval

arXiv · · CV RL

A new approach to composed video retrieval (CoVR) is presented, which leverages large multimodal models to infer causal and temporal consequences implied by an edit. The method aligns reasoned queries to candidate videos without task-specific finetuning. A new benchmark, CoVR-Reason, is introduced to evaluate reasoning in CoVR.

TII Launches Falcon Perception, A New Multimodal AI Model That Helps Machines See and Understand the World – with Efficiency that Rivals Larger Models

TII · · CV NLP

The Technology Innovation Institute (TII) has launched Falcon Perception, a new 600-million-parameter multimodal AI model. This model offers competitive performance in object segmentation, dense visual understanding, and document intelligence, rivalling larger systems like Meta’s SAM3 and Alibaba’s Qwen with significantly greater efficiency. Falcon Perception unifies image and language processing in a single architecture, designed for real-world deployment in compute-constrained environments. Why it matters: This development positions the UAE among leading nations in advanced multimodal AI, which is crucial for applications in robotics, advanced manufacturing, and autonomous platforms.

Technology Innovation Institute Achieves Fastest Speeds with Vision-based AI Drone Racing

TII · · Robotics RL

Technology Innovation Institute (TII) has developed AI-powered autonomous drones capable of navigating complex environments at speeds up to 80 km/h using only a camera and IMU sensor. The drones use onboard AI-driven visual odometry and reinforcement learning to adapt to their environment in real time. In direct competition, the TII drone set a best lap time of 4.38s, compared to 6.32s and 5.34s for human pilots. Why it matters: This research demonstrates the potential of AI-powered UAVs to surpass human-operated drones in agility and precision, with applications for the transport of goods and potentially people.

TII’s DERC Partners with Brazilian Technology Disruptor Radaz on Airborne Multi-band Interferometric Microwave Imaging Project

TII · · Research Partnership

TII's DERC, in partnership with Brazilian firm RADAZ, has obtained the first microwave images from their joint project on Airborne Multi-band Interferometric Microwave Imaging (A(MI)2) in Abu Dhabi. The project uses a new multiband Synthetic Aperture Radar (SAR) operating in P, L, and C frequency bands to generate terrain images. The system, which can be mounted on commercial drones, also integrates Ground Penetrating Radar capability to detect buried objects. Why it matters: This technology enhances remote sensing capabilities in the region, enabling applications in agriculture, infrastructure monitoring, and search and rescue operations.

Plant diversity predicts resistance to grazing pressure on drylands

KAUST · · Research KAUST

A KAUST-led study in *Nature Ecology & Evolution* finds that plant species diversity is the strongest predictor of dryland ecosystem resistance to grazing pressure, outperforming climate and soil factors. Analyzing 73 sites across 25 countries, researchers found that diverse plant communities better maintain vegetation cover under grazing. This is attributed to varied species responses distributing grazing pressure and buffering vegetation loss. Why it matters: The findings highlight the importance of biodiversity in maintaining the productivity and stability of dryland ecosystems, which support half of global livestock production and a billion people's livelihoods.

Robust Tightly-Coupled Filter-Based Monocular Visual-Inertial State Estimation and Graph-Based Evaluation for Autonomous Drone Racing

arXiv · · Robotics Research

This paper introduces ADR-VINS, a monocular visual-inertial state estimation framework based on an Error-State Kalman Filter (ESKF) designed for autonomous drone racing, integrating direct pixel reprojection errors from gate corners as innovation terms. It also introduces ADR-FGO, an offline Factor-Graph Optimization framework for generating high-fidelity reference trajectories for post-flight evaluation in GNSS-denied environments. Validated on the TII-RATM dataset, ADR-VINS achieved an average RMS translation error of 0.134 m and was successfully deployed in the A2RL Drone Championship Season 2. Why it matters: The framework provides a robust and efficient solution for drone state estimation in challenging racing environments, and enables performance evaluation without relying on external localization systems.

KAUST becomes first FIFA research institute in the Middle East and Asia

KAUST · · Partnership Research

KAUST has been selected as the first FIFA Research Institute in the Middle East and Asia. KAUST will apply its research expertise to advance football-related studies, initially focusing on developing datasets that enable deeper insights into the game. The collaboration’s first project focuses on developing AI algorithms to analyze historical FIFA World Cup broadcast footage, while the second project leverages player and ball tracking data from the FIFA World Cup 2022™ Qatar and the FIFA Women’s World Cup 2023™ Australia & New Zealand. Why it matters: This partnership strengthens the intersection of sport, academia, and industry in the region through high-impact scientific inquiry.

DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding

arXiv · · CV NLP

MBZUAI researchers introduce DuwatBench, a new benchmark for multimodal understanding of Arabic calligraphy. The dataset contains 1,272 samples across six calligraphic styles with detailed annotations to evaluate visual-text alignment. Evaluation of 13 multimodal models reveals challenges in processing calligraphic variations and artistic distortions, highlighting the need for culturally grounded AI research.

MonoRace: Winning Champion-Level Drone Racing with Robust Monocular AI

arXiv · · Robotics RL

The paper presents MonoRace, an onboard drone racing approach using a monocular camera and IMU. The system combines neural-network-based gate segmentation with a drone model for robust state estimation, along with offline optimization using gate geometry. MonoRace won the 2025 Abu Dhabi Autonomous Drone Racing Competition (A2RL), outperforming AI teams and human world champions, reaching speeds up to 100 km/h. Why it matters: This demonstrates a significant advancement in autonomous drone racing, achieving champion-level performance with a resource-efficient monocular system, validated in a real-world competition setting in the UAE.

Hybrid Deep Feature Extraction and ML for Construction and Demolition Debris Classification

arXiv · · CV Research

This paper introduces a hybrid deep learning and machine learning pipeline for classifying construction and demolition waste. A dataset of 1,800 images from UAE construction sites was created, and deep features were extracted using a pre-trained Xception network. The combination of Xception features with machine learning classifiers achieved up to 99.5% accuracy, demonstrating state-of-the-art performance for debris identification.

Drift-Corrected Monocular VIO and Perception-Aware Planning for Autonomous Drone Racing

arXiv · · Robotics RL

This paper details the autonomous drone racing system developed for the Abu Dhabi Autonomous Racing League (A2RL) x Drone Champions League competition. The system uses drift-corrected monocular Visual-Inertial Odometry (VIO) fused with YOLO-based gate detection for global position measurements, managed via Kalman filter. A perception-aware planner generates trajectories balancing speed and gate visibility. Why it matters: The system's podium finishes validate the effectiveness of monocular vision-based autonomous drone flight and showcases advancements in AI-powered robotics within the UAE.

Synthetic data can accurately track environmental disasters

KAUST · · Research Partnership

KAUST and SARsatX have developed a method using Generative Adversarial Networks (GANs) to generate synthetic SAR imagery for training deep learning models to detect oil spills. Starting with just 17 real SAR images, they generated over 2,000 synthetic images to train a Multi-Attention Network (MANet) model. The MANet model, trained exclusively on synthetic data, achieved 75% accuracy in identifying oil spill areas, matching the performance of models trained on larger real datasets. Why it matters: This advancement enables faster and more reliable environmental monitoring using AI, even when real-world data is scarce, reducing the need to wait for actual disasters to occur.

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

arXiv · · CV Research

The paper introduces the Prism Hypothesis, which posits a correspondence between an encoder's feature spectrum and its functional role, with semantic encoders capturing low-frequency components and pixel encoders retaining high-frequency information. Based on this, the authors propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details using a frequency-band modulator. Experiments on ImageNet and MS-COCO demonstrate that UAE effectively unifies semantic abstraction and pixel-level fidelity, achieving state-of-the-art performance.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv · · CV NLP

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving

arXiv · · CV RL

The paper introduces OmniGen, a unified framework for generating aligned multimodal sensor data for autonomous driving using a shared Bird's Eye View (BEV) space. It uses a novel generalizable multimodal reconstruction method (UAE) to jointly decode LiDAR and multi-view camera data through volume rendering. The framework incorporates a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation, demonstrating good performance and multimodal consistency.

From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region

arXiv · · CV Research

A new study compares vision-language models (VLMs) to YOLOv8 for wastewater treatment plant (WWTP) identification in satellite imagery across the MENA region. VLMs like Gemma-3 demonstrate superior zero-shot performance compared to YOLOv8, trained on a dataset of 83,566 satellite images from Egypt, Saudi Arabia, and UAE. The research suggests VLMs offer a scalable, annotation-free alternative for remote sensing of WWTPs.

Video-CoM: Interactive Video Reasoning via Chain of Manipulations

arXiv · · CV RL

Researchers at MBZUAI introduce "Interactive Video Reasoning," a new paradigm enabling models to actively "think with videos" by performing iterative visual actions to gather and refine evidence. They developed Video CoM, which reasons through a Chain of Manipulations (CoM), and constructed Video CoM Instruct, an 18K instruction tuning dataset for multi-step manipulation reasoning. The model is further optimized via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO), achieving strong results across nine video reasoning benchmarks.

Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

arXiv · · CV RL

Researchers at MBZUAI have introduced Video-R2, a reinforcement learning approach to improve the consistency and visual grounding of reasoning in multimodal language models. Video-R2 combines timestamp-aware supervised fine-tuning with Group Relative Policy Optimization (GRPO) guided by a Temporal Alignment Reward (TAR). The model demonstrates higher Think Answer Consistency (TAC), Video Attention Score (VAS), and accuracy across multiple benchmarks, showing improved temporal alignment and reasoning coherence for video understanding.

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

arXiv · · LLM CV

Researchers at MBZUAI have introduced EvoLMM, a self-evolving framework for large multimodal models that enhances reasoning capabilities without human-annotated data or reward distillation. EvoLMM uses two cooperative agents, a Proposer and a Solver, which generate image-grounded questions and solve them through internal consistency, using a continuous self-rewarding process. Evaluations using Qwen2.5-VL as the base model showed performance gains of up to 3% on multimodal math-reasoning benchmarks like ChartQA, MathVista, and MathVision using only raw training images.

MMRINet: Efficient Mamba-Based Segmentation with Dual-Path Refinement for Low-Resource MRI Analysis

arXiv · · Research Healthcare

Researchers from MBZUAI have developed MMRINet, a Mamba-based neural network for efficient brain tumor segmentation in MRI scans. The model uses Dual-Path Feature Refinement and Progressive Feature Aggregation to achieve high accuracy with only 2.5M parameters, making it suitable for low-resource clinical environments. MMRINet achieves a Dice score of 0.752 and HD95 of 12.23 on the BraTS-Lighthouse SSA 2025 benchmark.

Climate Adaptation-Aware Flood Prediction for Coastal Cities Using Deep Learning

arXiv · · Research CV

Researchers have developed a CNN-based deep learning model for predicting coastal flooding in cities under various sea-level rise scenarios. The model utilizes a vision-based, low-resource DL framework and is trained on datasets from Abu Dhabi and San Francisco. Results show a 20% reduction in mean absolute error compared to existing methods, demonstrating potential for scalable coastal flood management.

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

arXiv · · CV LLM

Researchers introduce MATRIX, a vision-centric agent tuning framework for robust tool-use reasoning in VLMs. The framework includes M-TRACE, a dataset of 28.5K multimodal tasks with 177K verified trajectories, and Pref-X, a set of 11K automatically generated preference pairs. Experiments show MATRIX consistently outperforms open- and closed-source VLMs across three benchmarks.

Gender Stereotypes in Professional Roles Among Saudis: An Analytical Study of AI-Generated Images Using Language Models

arXiv · · Research Ethics

The study analyzes over 1,000 images generated by ImageFX, DALL-E V3, and Grok for 56 Saudi professions, finding significant gender imbalances and cultural inaccuracies. DALL-E V3 exhibited the strongest gender stereotyping, with 96% male depictions, particularly in leadership and technical roles. The research underscores the need for diverse training data and culturally sensitive evaluation to ensure equitable AI outputs that accurately reflect Saudi Arabia's labor market and culture.

SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

arXiv · · CV NLP

Researchers from MBZUAI have introduced SPECS, a new reference-free evaluation metric for long image captions that modifies CLIP to emphasize specificity. SPECS aims to improve the correlation with human judgment while maintaining computational efficiency compared to LLM-based metrics. The proposed approach is intended for iterative use during image captioning model development, offering a practical alternative to existing methods.

Continuous Saudi Sign Language Recognition: A Vision Transformer Approach

arXiv · · NLP CV

The researchers introduce KAU-CSSL, the first continuous Saudi Sign Language (SSL) dataset focusing on complete sentences. They propose a transformer-based model using ResNet-18 for spatial feature extraction and a Transformer Encoder with Bidirectional LSTM for temporal dependencies. The model achieved 99.02% accuracy in signer-dependent mode and 77.71% in signer-independent mode, advancing communication tools for the SSL community.

Adoption of AI to accelerate world's largest coral restoration project

KAUST · · Partnership Research

KAUST is partnering with digiLab to develop AI for coral conservation within the KAUST Coral Restoration Initiative (KCRI). digiLab's AI platform will provide real-time simulations of the 100-hectare reefscape, aiding in understanding coral resilience and growth under changing conditions. The AI tools are expected to reduce coral assessment times from months to weeks and optimize sensor placement. Why it matters: This partnership sets a new standard for coral restoration by demonstrating a scalable AI-driven model for global conservation efforts.

Deep learning accelerates research on early pregnancies

KAUST · · Research Healthcare

KAUST researchers have developed deepBlastoid, a deep learning tool for evaluating models of human embryo development, called blastoids. deepBlastoid can evaluate images of blastoids at speeds 1000 times faster than expert scientists, processing 273 images per second. Trained on over 2000 microscopic blastoid images, it assesses the impact of chemicals on blastoid development using over 10,000 images. Why it matters: This AI tool accelerates research into early pregnancy, fertility complications, and the impact of chemicals on embryo development, with implications for reproductive technologies.

MIRA: A Novel Framework for Fusing Modalities in Medical RAG

arXiv · · Research Healthcare

MBZUAI researchers have introduced MIRA, a novel framework for improving the factual accuracy of multimodal large language models in medical applications. MIRA uses calibrated retrieval to manage factual risk and integrates image embeddings with a medical knowledge base for efficient reasoning. Evaluated on medical VQA and report generation benchmarks, MIRA achieves state-of-the-art results, with code available on GitHub.

ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models

arXiv · · CV Research

The paper introduces ScoreAdv, a novel approach for generating natural adversarial examples (UAEs) using diffusion models. It incorporates an adversarial guidance mechanism and saliency maps to shift the sampling distribution and inject visual information. Experiments on ImageNet and CelebA datasets demonstrate state-of-the-art attack success rates, image quality, and robustness against defenses.

MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

arXiv · · Research NLP

This paper introduces MOTOR, a multimodal retrieval and re-ranking approach for medical visual question answering (MedVQA) that uses grounded captions and optimal transport to capture relationships between queries and retrieved context, leveraging both textual and visual information. MOTOR identifies clinically relevant contexts to augment VLM input, achieving higher accuracy on MedVQA datasets. Empirical analysis shows MOTOR outperforms state-of-the-art methods by an average of 6.45%.

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

arXiv · · NLP LLM

A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.

TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

arXiv · · Research CV

MBZUAI researchers introduce TerraFM, a scalable self-supervised learning model for Earth observation that uses Sentinel-1 and Sentinel-2 imagery. The model unifies radar and optical inputs through modality-specific patch embeddings and adaptive cross-attention fusion. TerraFM achieves strong generalization on classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

arXiv · · CV LLM

Researchers from MBZUAI have introduced VideoMolmo, a large multimodal model for spatio-temporal pointing conditioned on textual descriptions. The model incorporates a temporal module with an attention mechanism and a temporal mask fusion pipeline using SAM2 for improved coherence across video sequences. They also curated a dataset of 72k video-caption pairs and introduced VPoS-Bench, a benchmark for evaluating generalization across real-world scenarios, with code and models publicly available.

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

arXiv · · Research CV

MBZUAI researchers introduce VideoMathQA, a new benchmark for evaluating mathematical reasoning in videos, requiring models to interpret visual information, text, and spoken cues. The dataset spans 10 mathematical domains with videos ranging from 10 seconds to over 1 hour, and includes multi-step reasoning annotations. The benchmark aims to evaluate temporal cross-modal reasoning and highlights the limitations of existing approaches in complex video-based mathematical problem solving.

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

arXiv · · Research CV

MBZUAI introduces Agent-X, a benchmark for evaluating multi-step reasoning in vision-centric agents across real-world, multimodal settings. Agent-X includes 828 tasks with diverse visual contexts and spans six environments, requiring tool use and stepwise decision-making. Experiments show that current LLMs struggle with multi-step vision tasks, achieving less than 50% success, highlighting areas for improvement in LMM reasoning and tool use.

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

arXiv · · Research Arabic AI

MBZUAI researchers introduce ARB, the first comprehensive benchmark for evaluating step-by-step multimodal reasoning in Arabic across textual and visual modalities. The benchmark spans 11 diverse domains and includes 1,356 multimodal samples with 5,119 human-curated reasoning steps. Evaluations of 12 state-of-the-art LMMs revealed challenges in coherence, faithfulness, and cultural grounding, highlighting the need for culturally aware AI systems.

Optimization of Module Transferability in Single Image Super-Resolution: Universality Assessment and Cycle Residual Blocks

arXiv · · CV Research

This paper introduces a method for quantifying the transferability of architectural components in Single Image Super-Resolution (SISR) models, termed "Universality," and proposes a Universality Assessment Equation (UAE). Guided by the UAE, the authors design optimized modules, Cycle Residual Block (CRB) and Depth-Wise Cycle Residual Block (DCRB), and demonstrate their effectiveness across various datasets and low-level tasks. Results show that networks using these modules outperform state-of-the-art methods, achieving improved PSNR or parameter reduction.

Genetic secrets of rice pave way for future farming and conservation

KAUST · · Research Healthcare

KAUST researchers have published a study in Nature Genetics detailing genomic analysis of wild rice relatives. The study examined nine tetraploid and two diploid wild relatives of rice, finding significant genetic diversity due to transposable elements. This diversity includes genes that confer resilience to heat, drought, and salinity. Why it matters: These findings can help improve rice yields, introduce rice cultivation to currently untenable regions, and protect rice crops against climate change, especially in the Middle East.

MedNNS: Supernet-based Medical Task-Adaptive Neural Network Search

arXiv · · Research Healthcare

The paper introduces MedNNS, a neural network search framework designed for medical imaging, addressing challenges in architecture selection and weight initialization. MedNNS constructs a meta-space encoding datasets and models based on their performance using a Supernetwork-based approach, expanding the model zoo size by 51x. The framework incorporates rank loss and Fréchet Inception Distance (FID) loss to capture inter-model and inter-dataset relationships, improving alignment in the meta-space and outperforming ImageNet pre-trained DL models and SOTA NAS methods.

SemDiff: Generating Natural Unrestricted Adversarial Examples via Semantic Attributes Optimization in Diffusion Models

arXiv · · CV Research

This paper introduces SemDiff, a novel method for generating unrestricted adversarial examples (UAEs) by exploring the semantic latent space of diffusion models. SemDiff uses multi-attribute optimization to ensure attack success while preserving the naturalness and imperceptibility of generated UAEs. Experiments on high-resolution datasets demonstrate SemDiff's superior performance compared to state-of-the-art methods in attack success rate and imperceptibility, while also evading defenses.

RP-SAM2: Refining Point Prompts for Stable Surgical Instrument Segmentation

arXiv · · CV Research

Researchers from MBZUAI introduced RP-SAM2, a method to improve surgical instrument segmentation by refining point prompts for more stable results. RP-SAM2 uses a novel shift block and compound loss function to reduce sensitivity to point prompt placement, improving segmentation accuracy in data-constrained settings. Experiments on the Cataract1k and CaDIS datasets show that RP-SAM2 enhances segmentation accuracy and reduces variance compared to SAM2, with code available on GitHub.

KAUST scientists see the first steps of life in DNA unwinding

KAUST · · Research Healthcare

KAUST researchers have captured the initial unwinding of DNA using cryo-electron microscopy and deep learning. The study details 15 atomic states describing how the Simian Virus 40 Large Tumor Antigen helicase unwinds DNA, revealing the coordinated roles of DNA, helicases, and ATP. The research elucidates the fundamental mechanisms of DNA replication, a cornerstone of growth and reproduction. Why it matters: This detailed understanding of helicase function could lead to advances in nanotechnology and our understanding of genetic processes.

SALT: Parameter-Efficient Fine-Tuning via Singular Value Adaptation with Low-Rank Transformation

arXiv · · Research Healthcare

Researchers introduce SALT, a parameter-efficient fine-tuning method for medical image segmentation that combines singular value adaptation with low-rank transformation. SALT selectively adapts influential singular values and complements this with a low-rank update for the remaining subspace. Experiments on five medical datasets show SALT outperforms state-of-the-art PEFT methods by 2-5% in Dice score with only 3.9% trainable parameters.

Tracking Meets Large Multimodal Models for Driving Scenario Understanding

arXiv · · LLM CV

Researchers at MBZUAI have introduced a novel approach to enhance Large Multimodal Models (LMMs) for autonomous driving by integrating 3D tracking information. This method uses a track encoder to embed spatial and temporal data, enriching visual queries and improving the LMM's understanding of driving scenarios. Experiments on DriveLM-nuScenes and DriveLM-CARLA benchmarks demonstrate significant improvements in perception, planning, and prediction tasks compared to baseline models.

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

arXiv · · Research CV

Researchers introduce TimeTravel, a benchmark dataset for evaluating large multimodal models (LMMs) on historical and cultural artifacts. The benchmark comprises 10,250 expert-verified samples across 266 cultures and 10 historical regions, designed to assess AI in tasks like classification and interpretation of manuscripts, artworks, inscriptions, and archaeological discoveries. The goal is to establish AI as a reliable partner in preserving cultural heritage and assisting researchers.

Towards Robust Multimodal Open-set Test-time Adaptation via Adaptive Entropy-aware Optimization

arXiv · · Research CV

This paper introduces Adaptive Entropy-aware Optimization (AEO), a new framework to tackle Multimodal Open-set Test-time Adaptation (MM-OSTTA). AEO uses Unknown-aware Adaptive Entropy Optimization (UAE) and Adaptive Modality Prediction Discrepancy Optimization (AMP) to distinguish unknown class samples during online adaptation by amplifying the entropy difference between known and unknown samples. The study establishes a new benchmark derived from existing datasets with five modalities and evaluates AEO's performance across various domain shift scenarios, demonstrating its effectiveness in long-term and continual MM-OSTTA settings.

VENOM: Text-driven Unrestricted Adversarial Example Generation with Diffusion Models

arXiv · · Research CV

The paper introduces VENOM, a text-driven framework for generating high-quality unrestricted adversarial examples using diffusion models. VENOM unifies image content generation and adversarial synthesis into a single reverse diffusion process, enhancing both attack success rate and image quality. The framework incorporates an adaptive adversarial guidance strategy with momentum to ensure the generated adversarial examples align with the distribution of natural images.