MBZUAI researchers have developed a new approach to enhance the generalizability of vision-language models when processing out-of-distribution data. The study, led by Sheng Zhang and involving multiple MBZUAI professors and researchers, addresses the challenge of AI applications needing to manage unforeseen circumstances. The new method aims to improve how these models, which combine natural language processing and computer vision, handle new information not used during training. Why it matters: Improving the adaptability of vision-language models is critical for real-world AI applications like autonomous driving and medical imaging, especially in diverse and changing environments.
MBZUAI researchers presented EXAMS-V, a new benchmark dataset for evaluating the reasoning and processing abilities of vision language models (VLMs). EXAMS-V contains over 20,000 multiple-choice questions across 26 subjects and 11 languages, including Arabic. The dataset presents the questions within images, testing the VLM's ability to integrate visual and textual information. Why it matters: This dataset fills a gap in VLM evaluation, providing a valuable resource for assessing and improving the multimodal reasoning capabilities of these models, particularly in diverse languages like Arabic.
Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.
MBZUAI researchers introduce UniMed-CLIP, a unified Vision-Language Model (VLM) for diverse medical imaging modalities, trained on the new large-scale, open-source UniMed dataset. UniMed comprises over 5.3 million image-text pairs across six modalities: X-ray, CT, MRI, Ultrasound, Pathology, and Fundus, created using LLMs to transform classification datasets into image-text formats. UniMed-CLIP significantly outperforms existing generalist VLMs and matches modality-specific medical VLMs in zero-shot evaluations, improving over BiomedCLIP by +12.61 on average across 21 datasets while using 3x less training data.
Researchers at MBZUAI, IBM Research, and other institutions have developed EarthDial, a new vision-language model (VLM) specifically designed to process geospatial data from remote sensing technologies. EarthDial handles data in multiple modalities and resolutions, processing images captured at different times to observe environmental changes. The model outperformed others on over 40 tasks including image classification, object detection, and change detection. Why it matters: This unified model bridges the gap between generic VLMs and domain-specific models, enabling complex geospatial data analysis for applications like disaster assessment and climate monitoring in the region.
A presentation discusses the evolution of Vision-and-Language Navigation (VLN) from benchmarks like Room-to-Room (R2R). It highlights the role of Large Language Models (LLMs) such as GPT-4 in enabling more natural human-machine interactions. The presentation showcases work using LLMs to decode navigational instructions and improve robotic navigation. Why it matters: This research demonstrates the potential of merging vision, language, and robotics for advanced AI applications in navigation and human-computer interaction.
This paper introduces MOTOR, a multimodal retrieval and re-ranking approach for medical visual question answering (MedVQA) that uses grounded captions and optimal transport to capture relationships between queries and retrieved context, leveraging both textual and visual information. MOTOR identifies clinically relevant contexts to augment VLM input, achieving higher accuracy on MedVQA datasets. Empirical analysis shows MOTOR outperforms state-of-the-art methods by an average of 6.45%.