Skip to content
GCC AI Research

Search

Results for "image retrieval"

MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

arXiv ·

This paper introduces MOTOR, a multimodal retrieval and re-ranking approach for medical visual question answering (MedVQA) that uses grounded captions and optimal transport to capture relationships between queries and retrieved context, leveraging both textual and visual information. MOTOR identifies clinically relevant contexts to augment VLM input, achieving higher accuracy on MedVQA datasets. Empirical analysis shows MOTOR outperforms state-of-the-art methods by an average of 6.45%.

Retrieval Augmentation as a Shortcut to the Training Data

MBZUAI ·

This article discusses retrieval augmentation in text generation, where information retrieved from an external source is used to condition predictions. It references recent work on retrieval-augmented image captioning, showing that model size can be greatly reduced when training data is available through retrieval. The author intends to continue this work focusing on the intersection of retrieval augmentation and in-context learning, and controllable image captioning for language learning materials. Why it matters: This research direction has the potential to improve transfer learning in vision-language models, which could be especially relevant for downstream applications in Arabic NLP and multimodal tasks.

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

arXiv ·

The paper introduces the Prism Hypothesis, which posits a correspondence between an encoder's feature spectrum and its functional role, with semantic encoders capturing low-frequency components and pixel encoders retaining high-frequency information. Based on this, the authors propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details using a frequency-band modulator. Experiments on ImageNet and MS-COCO demonstrate that UAE effectively unifies semantic abstraction and pixel-level fidelity, achieving state-of-the-art performance.

Memory representation and retrieval in neuroscience and AI

MBZUAI ·

A Caltech researcher presented at MBZUAI on memory representation and retrieval, contrasting AI and neuroscience approaches. Current AI retrieval systems like RAG retrieve via fine-tuning and embedding similarity, while the presenter argued for exploring retrieval via combinatorial object identity or spatial proximity. The research explores circuit-level retrieval via domain fine-tuned LLMs and distributed memory for image retrieval using semantic similarity. Why it matters: The work suggests structured databases and retrieval-focused training can allow smaller models to outperform larger general-purpose models, offering efficiency gains for AI development in the region.