A new paper at ICCV 2025, co-authored by MBZUAI Ph.D. student Dmitry Demidov, introduces Dense-WebVid-CoVR, a 1.6-million sample benchmark for composed video retrieval (CoVR). The benchmark features longer, context-rich descriptions and modification texts, generated using Gemini Pro and GPT-4o, with manual verification. The paper also presents a unified fusion approach that jointly reasons across video and text inputs, improving performance on fine-grained edit details. Why it matters: This work advances video search capabilities by enabling more human-like queries, which is crucial for creative and analytic workflows that require nuanced video retrieval.
A new approach to composed video retrieval (CoVR) is presented, which leverages large multimodal models to infer causal and temporal consequences implied by an edit. The method aligns reasoned queries to candidate videos without task-specific finetuning. A new benchmark, CoVR-Reason, is introduced to evaluate reasoning in CoVR.
This article discusses retrieval augmentation in text generation, where information retrieved from an external source is used to condition predictions. It references recent work on retrieval-augmented image captioning, showing that model size can be greatly reduced when training data is available through retrieval. The author intends to continue this work focusing on the intersection of retrieval augmentation and in-context learning, and controllable image captioning for language learning materials. Why it matters: This research direction has the potential to improve transfer learning in vision-language models, which could be especially relevant for downstream applications in Arabic NLP and multimodal tasks.
MBZUAI researchers presented a new approach to video question answering at ICCV 2023. The method leverages insights from analyzing still images to understand video content, potentially reducing the computational resources needed for training video question answering models. Guangyi Chen, Kun Zhang, and colleagues aim to apply pre-trained image models to understand video concepts. Why it matters: This research could lead to more efficient and accessible video analysis tools, benefiting fields like healthcare and security where video data is abundant.
Ivan Laptev from INRIA Paris presented a talk at MBZUAI on embodied multi-modal visual understanding, covering advancements in video understanding tasks like question answering and captioning. The talk highlighted recent work on vision-language navigation and manipulation. He argued that detailed understanding of the physical world through vision is still in early stages, discussing open research directions related to robotics and video generation. Why it matters: The discussion of robotics applications and future research directions in embodied AI could influence the direction of AI research and development in the UAE, particularly at MBZUAI.
This paper introduces MOTOR, a multimodal retrieval and re-ranking approach for medical visual question answering (MedVQA) that uses grounded captions and optimal transport to capture relationships between queries and retrieved context, leveraging both textual and visual information. MOTOR identifies clinically relevant contexts to augment VLM input, achieving higher accuracy on MedVQA datasets. Empirical analysis shows MOTOR outperforms state-of-the-art methods by an average of 6.45%.
A Caltech researcher presented at MBZUAI on memory representation and retrieval, contrasting AI and neuroscience approaches. Current AI retrieval systems like RAG retrieve via fine-tuning and embedding similarity, while the presenter argued for exploring retrieval via combinatorial object identity or spatial proximity. The research explores circuit-level retrieval via domain fine-tuned LLMs and distributed memory for image retrieval using semantic similarity. Why it matters: The work suggests structured databases and retrieval-focused training can allow smaller models to outperform larger general-purpose models, offering efficiency gains for AI development in the region.