MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

arXiv · June 28, 2025 · Significant research

Summary

This paper introduces MOTOR, a multimodal retrieval and re-ranking approach for medical visual question answering (MedVQA) that uses grounded captions and optimal transport to capture relationships between queries and retrieved context, leveraging both textual and visual information. MOTOR identifies clinically relevant contexts to augment VLM input, achieving higher accuracy on MedVQA datasets. Empirical analysis shows MOTOR outperforms state-of-the-art methods by an average of 6.45%.

Keywords

MedVQA · visual question answering · retrieval-augmented generation · optimal transport · vision-language models

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

New approach for better AI analysis of medical images presented at MICCAI

MBZUAI · Invalid Date

MBZUAI researchers developed a new approach called Multimodal Optimal Transport via Grounded Retrieval (MOTOR) to improve the accuracy of vision-language models for medical image analysis. MOTOR combines retrieval-augmented generation (RAG) with an optimal transport algorithm to retrieve and rank relevant image and textual data. Testing on two medical datasets showed that MOTOR improved average performance by 6.45%. Why it matters: This technique addresses the challenges of limited specialized medical datasets and computational costs associated with training AI models for medical image interpretation, offering a more efficient and accurate solution.

MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

Summary

Keywords

Related

New approach for better AI analysis of medical images presented at MICCAI